Node in Not-Ready state

Background

Corporation XYZ's DevOps team has deployed a new node group and the application team deployed a new application outside of the retail-app, including a deployment (prod-app) and its supporting DaemonSet (prod-ds).

After deploying these applications, the monitoring team has reported that the node is transitioning to a NotReady state. The root cause isn't immediately apparent, and as the DevOps on-call engineer, you need to investigate why the node is becoming unresponsive and implement a solution to restore normal operation.

Step 1: Verify Node Status

Let's first verify the node's status to confirm the current state:

~$kubectl get nodes --selector=eks.amazonaws.com/nodegroup=new_nodegroup_3

NAME                                          STATUS     ROLES    AGE     VERSION

ip-10-42-180-244.us-west-2.compute.internal   NotReady   <none>   15m     v1.27.1-eks-2f008fe

info

Note: For your convenience, we have added the node name as the environment variable $NODE_NAME.

Step 2: Check System Pod Status

Let's examine the status of kube-system pods on the affected node to identify any system-level issues:

~$kubectl get pods -n kube-system -o wide --field-selector spec.nodeName=$NODE_NAME

This command will show us all kube-system pods running on the affected node, helping us identify any potential issues of the node caused by these. You should note that all the pods are in running state.

Step 3: Examine Node Conditions

Let's examine the node's describe output to understand the cause of the NotReady state.

~$kubectl describe node $NODE_NAME | sed -n '/^Taints:/,/^[A-Z]/p;/^Conditions:/,/^[A-Z]/p;/^Events:/,$p'

Taints:             node.kubernetes.io/unreachable:NoExecute

                    node.kubernetes.io/unreachable:NoSchedule

Unschedulable:      false

Conditions:

  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message

  ----             ------    -----------------                 ------------------                ------              -------

  MemoryPressure   Unknown   Wed, 12 Feb 2025 15:20:21 +0000   Wed, 12 Feb 2025 15:21:04 +0000   NodeStatusUnknown   Kubelet stopped posting node status.

  DiskPressure     Unknown   Wed, 12 Feb 2025 15:20:21 +0000   Wed, 12 Feb 2025 15:21:04 +0000   NodeStatusUnknown   Kubelet stopped posting node status.

  PIDPressure      Unknown   Wed, 12 Feb 2025 15:20:21 +0000   Wed, 12 Feb 2025 15:21:04 +0000   NodeStatusUnknown   Kubelet stopped posting node status.

  Ready            Unknown   Wed, 12 Feb 2025 15:20:21 +0000   Wed, 12 Feb 2025 15:21:04 +0000   NodeStatusUnknown   Kubelet stopped posting node status.

Addresses:

Events:

  Type     Reason                   Age                    From                     Message

  ----     ------                   ----                   ----                     -------

  Normal   Starting                 3m18s                  kube-proxy

  Normal   Starting                 3m31s                  kubelet                  Starting kubelet.

  Warning  InvalidDiskCapacity      3m31s                  kubelet                  invalid capacity 0 on image filesystem

  Normal   NodeHasSufficientMemory  3m31s (x2 over 3m31s)  kubelet                  Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeHasSufficientMemory

  Normal   NodeHasNoDiskPressure    3m31s (x2 over 3m31s)  kubelet                  Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeHasNoDiskPressure

  Normal   NodeHasSufficientPID     3m31s (x2 over 3m31s)  kubelet                  Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeHasSufficientPID

  Normal   NodeAllocatableEnforced  3m31s                  kubelet                  Updated Node Allocatable limit across pods

  Normal   RegisteredNode           3m27s                  node-controller          Node ip-10-42-180-244.us-west-2.compute.internal event: Registered Node ip-10-42-180-244.us-west-2.compute.internal in Controller

  Normal   Synced                   3m27s                  cloud-node-controller    Node synced successfully

  Normal   ControllerVersionNotice  3m12s                  vpc-resource-controller  The node is managed by VPC resource controller version v1.6.3

  Normal   NodeReady                3m10s                  kubelet                  Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeReady

  Normal   NodeTrunkInitiated       3m8s                   vpc-resource-controller  The node has trunk interface initialized successfully

  Warning  SystemOOM                94s                    kubelet                  System OOM encountered, victim process: python, pid: 4763

  Normal   NodeNotReady             52s                    node-controller          Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeNotReady

Here we see that the Node's kubelet is in the Unknown state and cannot be reached. You can read more about this status from the Kubernetes documentation.

Node Status Information

The node has the following taints:

node.kubernetes.io/unreachable:NoExecute: Indicates pods will be evicted if they don't tolerate this taint
node.kubernetes.io/unreachable:NoSchedule: Prevents new pods from being scheduled

The node conditions show that the kubelet has stopped posting status updates, which can typically indicate severe resource constraints or system instability.

Step 4: CloudWatch Metrics Investigation

Since Metrics Server isn't providing data, let's use CloudWatch to check EC2 instance metrics:

info

For your convenience, the instance ID of the worker node in newnodegroup_3 has been stored as an environment variable $INSTANCEID.

~$aws cloudwatch get-metric-data --region us-west-2 --start-time $(date -u -d '1 hour ago' +"%Y-%m-%dT%H:%M:%SZ") --end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") --metric-data-queries '[{"Id":"cpu","MetricStat":{"Metric":{"Namespace":"AWS/EC2","MetricName":"CPUUtilization","Dimensions":[{"Name":"InstanceId","Value":"'$INSTANCE_ID'"}]},"Period":60,"Stat":"Average"}}]'

    "MetricDataResults": [

            "Id": "cpu",

            "Label": "CPUUtilization",

            "Timestamps": [

                "2025-0X-XXT16:25:00+00:00",

                "2025-0X-XXT16:20:00+00:00",

                "2025-0X-XXT16:15:00+00:00",

                "2025-0X-XXT16:10:00+00:00"

],

            "Values": [

                99.87333333333333,

                99.89633636636336,

                99.86166666666668,

                62.67880324995537

],

            "StatusCode": "Complete"

],

    "Messages": []

info

The CloudWatch metrics reveal:

CPU utilization consistently above 99%
Significant increase in resource usage over time
Clear indication of resource exhaustion

Step 5: Mitigate Impact

Let's check deployment details and implement immediate changes to stabilize the node:

5.1. Check the deployment resource configurations

~$kubectl get pods -n prod -o custom-columns="NAME:.metadata.name,CPU_REQUEST:.spec.containers[*].resources.requests.cpu,MEM_REQUEST:.spec.containers[*].resources.requests.memory,CPU_LIMIT:.spec.containers[*].resources.limits.cpu,MEM_LIMIT:.spec.containers[*].resources.limits.memory"

NAME                        CPU_REQUEST   MEM_REQUEST   CPU_LIMIT   MEM_LIMIT

prod-app-74b97f9d85-k6c84   100m          64Mi          <none>      <none>

prod-app-74b97f9d85-mpcrv   100m          64Mi          <none>      <none>

prod-app-74b97f9d85-wdqlr   100m          64Mi          <none>      <none>

...

...

prod-ds-558sx               100m          128Mi         <none>      <none>

info

Notice that neither the deployment nor the DaemonSet has resource limits configured, which allowed unconstrained resource consumption.

5.2. Let's scale down the deployment and stop the resource overload

~$kubectl scale deployment/prod-app -n prod --replicas=0 && kubectl delete pod -n prod -l app=prod-app --force --grace-period=0 && kubectl wait --for=delete pod -n prod -l app=prod-app

5.3. Recycle the node on the nodegroup

~$aws eks update-nodegroup-config --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_3 --scaling-config desiredSize=0 && \

aws eks wait nodegroup-active --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_3 && \

aws eks update-nodegroup-config --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_3 --labels "addOrUpdateLabels={status=new-node}" && \

aws eks wait nodegroup-active --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_3 && \

aws eks update-nodegroup-config --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_3 --scaling-config desiredSize=1 && \

aws eks wait nodegroup-active --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_3 && \

for i in {1..12}; do NODE_NAME_2=$(kubectl get nodes --selector=eks.amazonaws.com/nodegroup=new_nodegroup_3,status=new-node --no-headers -o custom-columns=":metadata.name" 2>/dev/null) && [ -n "$NODE_NAME_2" ] && break || sleep 5; done && \

[ -n "$NODE_NAME_2" ]

info

This can take up to a little over 1 minute. The script will store the new node name as NODE_NAME_2.

5.4. Verify node status

~$kubectl get nodes --selector=kubernetes.io/hostname=$NODE_NAME_2

NAME                                          STATUS   ROLES    AGE     VERSION

ip-10-42-180-24.us-west-2.compute.internal    Ready    <none>   0h43m   v1.30.8-eks-aeac579

Step 6: Implementing Long-term Solutions

The Dev team has identified and fixed a memory leak in the application. Let's implement the fix and establish proper resource management:

6.1. Apply the updated application configuration

~$kubectl apply -f /home/ec2-user/environment/eks-workshop/modules/troubleshooting/workernodes/yaml/configmaps-new.yaml

6.2. Set resource limits for the deployment (cpu: 500m, memory: 512Mi)

~$kubectl patch deployment prod-app -n prod --patch '{"spec":{"template":{"spec":{"containers":[{"name":"prod-app","resources":{"limits":{"cpu":"500m","memory":"512Mi"},"requests":{"cpu":"250m","memory":"256Mi"}}}]}}}}'

6.3. Set resource limits for the DaemonSet (cpu: 500m, memory: 512Mi)

~$kubectl patch daemonset prod-ds -n prod --patch '{"spec":{"template":{"spec":{"containers":[{"name":"prod-ds","resources":{"limits":{"cpu":"500m","memory":"512Mi"},"requests":{"cpu":"250m","memory":"256Mi"}}}]}}}}'

6.4. Perform rolling updates and scale back to desired state

~$kubectl rollout restart deployment/prod-app -n prod && kubectl rollout restart daemonset/prod-ds -n prod && kubectl scale deployment prod-app -n prod --replicas=6

Step 7: Verification

Let's verify our fixes have resolved the issues:

7.1 Check pod creations

~$kubectl get pods -n prod

NAME                        READY   STATUS    RESTARTS   AGE

prod-app-666f8f7bd5-658d6   1/1     Running   0          1m

prod-app-666f8f7bd5-6jrj4   1/1     Running   0          1m

prod-app-666f8f7bd5-9rf6m   1/1     Running   0          1m

prod-app-666f8f7bd5-pm545   1/1     Running   0          1m

prod-app-666f8f7bd5-ttkgs   1/1     Running   0          1m

prod-app-666f8f7bd5-zm8lx   1/1     Running   0          1m

prod-ds-ll4lv               1/1     Running   0          1m

7.2. Check pod limits

NAME                        CPU_REQUEST   MEM_REQUEST   CPU_LIMIT   MEM_LIMIT

prod-app-6d67889dc8-4hc7m   250m          256Mi         500m        512Mi

prod-app-6d67889dc8-6s8wr   250m          256Mi         500m        512Mi

prod-app-6d67889dc8-fd6kq   250m          256Mi         500m        512Mi

prod-app-6d67889dc8-gzcbn   250m          256Mi         500m        512Mi

prod-app-6d67889dc8-qvtvj   250m          256Mi         500m        512Mi

prod-app-6d67889dc8-rf478   250m          256Mi         500m        512Mi

prod-ds-srdqx               250m          256Mi         500m        512Mi

7.3 Check node CPU resource

~$INSTANCE_ID=$(kubectl get node ${NODE_NAME_2} -o jsonpath='{.spec.providerID}' | cut -d '/' -f5) && aws cloudwatch get-metric-data --region us-west-2 --start-time $(date -u -d '1 hour ago' +"%Y-%m-%dT%H:%M:%SZ") --end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") --metric-data-queries '[{"Id":"cpu","MetricStat":{"Metric":{"Namespace":"AWS/EC2","MetricName":"CPUUtilization","Dimensions":[{"Name":"InstanceId","Value":"'$INSTANCE_ID'"}]},"Period":60,"Stat":"Average"}}]'

    "MetricDataResults": [

            "Id": "cpu",

            "Label": "CPUUtilization",

            "Timestamps": [

                "2025-0X-XXT18:30:00+00:00",

                "2025-0X-XXT18:25:00+00:00"

],

            "Values": [

                88.05,

                58.63008430846801

],

            "StatusCode": "Complete"

],

    "Messages": []

info

Check that CPU is not over utilized.

7.4. Check node status

~$kubectl get node --selector=kubernetes.io/hostname=$NODE_NAME_2

NAME                                          STATUS   ROLES    AGE     VERSION

ip-10-42-180-24.us-west-2.compute.internal    Ready    <none>   1h35m   v1.30.8-eks-aeac579

Node in Not-Ready state

Background

Step 1: Verify Node Status

Step 2: Check System Pod Status

Step 3: Examine Node Conditions

Step 4: CloudWatch Metrics Investigation

Step 5: Mitigate Impact

5.1. Check the deployment resource configurations

5.2. Let's scale down the deployment and stop the resource overload

5.3. Recycle the node on the nodegroup

5.4. Verify node status

Step 6: Implementing Long-term Solutions

6.1. Apply the updated application configuration

6.2. Set resource limits for the deployment (cpu: 500m, memory: 512Mi)

6.3. Set resource limits for the DaemonSet (cpu: 500m, memory: 512Mi)

6.4. Perform rolling updates and scale back to desired state

Step 7: Verification

7.1 Check pod creations

7.2. Check pod limits

7.3 Check node CPU resource

7.4. Check node status

Key Takeaways

1. Resource Management

2. Monitoring

3. Best Practices

Additional Resources

Background​

Step 1: Verify Node Status​

Step 2: Check System Pod Status​

Step 3: Examine Node Conditions​

Step 4: CloudWatch Metrics Investigation​

Step 5: Mitigate Impact​

5.1. Check the deployment resource configurations​

5.2. Let's scale down the deployment and stop the resource overload​

5.3. Recycle the node on the nodegroup​

5.4. Verify node status​

Step 6: Implementing Long-term Solutions​

6.1. Apply the updated application configuration​

6.2. Set resource limits for the deployment (cpu: 500m, memory: 512Mi)​

6.3. Set resource limits for the DaemonSet (cpu: 500m, memory: 512Mi)​

6.4. Perform rolling updates and scale back to desired state​

Step 7: Verification​

7.1 Check pod creations​

7.2. Check pod limits​

7.3 Check node CPU resource​

7.4. Check node status​

Key Takeaways​

1. Resource Management​

2. Monitoring​

3. Best Practices​

Additional Resources​

Background

Step 1: Verify Node Status

Step 2: Check System Pod Status

Step 3: Examine Node Conditions

Step 4: CloudWatch Metrics Investigation

Step 5: Mitigate Impact

5.1. Check the deployment resource configurations

5.2. Let's scale down the deployment and stop the resource overload

5.3. Recycle the node on the nodegroup

5.4. Verify node status

Step 6: Implementing Long-term Solutions

6.1. Apply the updated application configuration

6.2. Set resource limits for the deployment (cpu: 500m, memory: 512Mi)

6.3. Set resource limits for the DaemonSet (cpu: 500m, memory: 512Mi)

6.4. Perform rolling updates and scale back to desired state

Step 7: Verification

7.1 Check pod creations

7.2. Check pod limits

7.3 Check node CPU resource

7.4. Check node status

Key Takeaways

1. Resource Management

2. Monitoring

3. Best Practices

Additional Resources