Skip to main content

Node in Not-Ready state

Background

Corporation XYZ's DevOps team has deployed a new node group and the application team deployed a new application outside of the retail-app, including a deployment (prod-app) and its supporting DaemonSet (prod-ds).

After deploying these applications, the monitoring team has reported that the node is transitioning to a NotReady state. The root cause isn't immediately apparent, and as the DevOps on-call engineer, you need to investigate why the node is becoming unresponsive and implement a solution to restore normal operation.

Step 1: Verify Node Status

Let's first verify the node's status to confirm the current state:

~$kubectl get nodes --selector=eks.amazonaws.com/nodegroup=new_nodegroup_3
NAME                                          STATUS     ROLES    AGE     VERSION
ip-10-42-180-244.us-west-2.compute.internal   NotReady   <none>   15m     v1.27.1-eks-2f008fe
info

Note: For your convenience, we have added the node name as the environment variable $NODE_NAME.

Step 2: Check System Pod Status

Let's examine the status of kube-system pods on the affected node to identify any system-level issues:

~$kubectl get pods -n kube-system -o wide --field-selector spec.nodeName=$NODE_NAME

This command will show us all kube-system pods running on the affected node, helping us identify any potential issues of the node caused by these. You should note that all the pods are in running state.

Step 3: Examine Node Conditions

Let's examine the node's describe output to understand the cause of the NotReady state.

~$kubectl describe node $NODE_NAME | sed -n '/^Taints:/,/^[A-Z]/p;/^Conditions:/,/^[A-Z]/p;/^Events:/,$p'
 
 
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Wed, 12 Feb 2025 15:20:21 +0000   Wed, 12 Feb 2025 15:21:04 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Wed, 12 Feb 2025 15:20:21 +0000   Wed, 12 Feb 2025 15:21:04 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Wed, 12 Feb 2025 15:20:21 +0000   Wed, 12 Feb 2025 15:21:04 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Wed, 12 Feb 2025 15:20:21 +0000   Wed, 12 Feb 2025 15:21:04 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:
Events:
  Type     Reason                   Age                    From                     Message
  ----     ------                   ----                   ----                     -------
  Normal   Starting                 3m18s                  kube-proxy
  Normal   Starting                 3m31s                  kubelet                  Starting kubelet.
  Warning  InvalidDiskCapacity      3m31s                  kubelet                  invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  3m31s (x2 over 3m31s)  kubelet                  Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    3m31s (x2 over 3m31s)  kubelet                  Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     3m31s (x2 over 3m31s)  kubelet                  Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  3m31s                  kubelet                  Updated Node Allocatable limit across pods
  Normal   RegisteredNode           3m27s                  node-controller          Node ip-10-42-180-244.us-west-2.compute.internal event: Registered Node ip-10-42-180-244.us-west-2.compute.internal in Controller
  Normal   Synced                   3m27s                  cloud-node-controller    Node synced successfully
  Normal   ControllerVersionNotice  3m12s                  vpc-resource-controller  The node is managed by VPC resource controller version v1.6.3
  Normal   NodeReady                3m10s                  kubelet                  Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeReady
  Normal   NodeTrunkInitiated       3m8s                   vpc-resource-controller  The node has trunk interface initialized successfully
  Warning  SystemOOM                94s                    kubelet                  System OOM encountered, victim process: python, pid: 4763
  Normal   NodeNotReady             52s                    node-controller          Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeNotReady

Here we see that the Node's kubelet is in the Unknown state and cannot be reached. You can read more about this status from the Kubernetes documentation.

Node Status Information

The node has the following taints:

  • node.kubernetes.io/unreachable:NoExecute: Indicates pods will be evicted if they don't tolerate this taint
  • node.kubernetes.io/unreachable:NoSchedule: Prevents new pods from being scheduled

The node conditions show that the kubelet has stopped posting status updates, which can typically indicate severe resource constraints or system instability.

Step 4: Analyzing Resource Usage

Let's examine the resource utilization of our workloads using a monitoring tool.

info

The metrics-server has already been installed in your cluster to provide resource usage data.

4.1. First, check node-level metrics

~$kubectl top nodes
NAME                                          CPU(cores)   CPU%        MEMORY(bytes)   MEMORY%
ip-10-42-142-116.us-west-2.compute.internal   34m          1%          940Mi           13%
ip-10-42-185-41.us-west-2.compute.internal    27m          1%          1071Mi          15%
ip-10-42-96-176.us-west-2.compute.internal    175m         9%          2270Mi          32%
ip-10-42-180-244.us-west-2.compute.internal   <unknown>    <unknown>   <unknown>       <unknown>

4.2. Next, attempt to check pod metrics

~$kubectl top pods -n prod
error: Metrics not available for pod prod/prod-app-xx-xx, age: 17m14.466020856s
note

We can observe that:

  • The troubled node shows unknown for all metrics
  • Other nodes are operating normally with moderate resource usage
  • Pod metrics in the prod namespace are unavailable

Step 5: CloudWatch Metrics Investigation

Since Metrics Server isn't providing data, let's use CloudWatch to check EC2 instance metrics:

info

For your convenience, the instance ID of the worker node in newnodegroup_3 has been stored as an environment variable $INSTANCEID.

~$aws cloudwatch get-metric-data --region us-west-2 --start-time $(date -u -d '1 hour ago' +"%Y-%m-%dT%H:%M:%SZ") --end-time $(date -u +"%Y-%m-%dT%H:%M:%SZ") --metric-data-queries '[{"Id":"cpu","MetricStat":{"Metric":{"Namespace":"AWS/EC2","MetricName":"CPUUtilization","Dimensions":[{"Name":"InstanceId","Value":"'$INSTANCE_ID'"}]},"Period":60,"Stat":"Average"}}]'
 
{
    "MetricDataResults": [
        {
            "Id": "cpu",
            "Label": "CPUUtilization",
            "Timestamps": [
                "2025-02-12T16:25:00+00:00",
                "2025-02-12T16:20:00+00:00",
                "2025-02-12T16:15:00+00:00",
                "2025-02-12T16:10:00+00:00"
            ],
            "Values": [
                99.87333333333333,
                99.89633636636336,
                99.86166666666668,
                62.67880324995537
            ],
            "StatusCode": "Complete"
        }
    ],
    "Messages": []
}
info

The CloudWatch metrics reveal:

  • CPU utilization consistently above 99%
  • Significant increase in resource usage over time
  • Clear indication of resource exhaustion

Step 6: Mitigate Impact

Let's check deployment details and implement immediate changes to stabilize the node:

6.1. Check the deployment resource configurations

~$kubectl get pods -n prod -o custom-columns="NAME:.metadata.name,CPU_REQUEST:.spec.containers[*].resources.requests.cpu,MEM_REQUEST:.spec.containers[*].resources.requests.memory,CPU_LIMIT:.spec.containers[*].resources.limits.cpu,MEM_LIMIT:.spec.containers[*].resources.limits.memory"
NAME                        CPU_REQUEST   MEM_REQUEST   CPU_LIMIT   MEM_LIMIT
prod-app-74b97f9d85-k6c84   100m          64Mi          <none>      <none>
prod-app-74b97f9d85-mpcrv   100m          64Mi          <none>      <none>
prod-app-74b97f9d85-wdqlr   100m          64Mi          <none>      <none>
...
...
prod-ds-558sx               100m          128Mi         <none>      <none>
info

Notice that neither the deployment nor the DaemonSet has resource limits configured, which allowed unconstrained resource consumption.

6.2. Let's scale down the deployment and stop the resource overload

~$kubectl scale deployment/prod-app -n prod --replicas=0 && kubectl delete pod -n prod -l app=prod-app --force --grace-period=0 && kubectl wait --for=delete pod -n prod -l app=prod-app

6.3. Recycle the node on the nodegroup

~$aws eks update-nodegroup-config --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_3 --scaling-config desiredSize=0 && aws eks wait nodegroup-active --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_3 && aws eks update-nodegroup-config --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_3 --scaling-config desiredSize=1 && aws eks wait nodegroup-active --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_3 && for i in {1..6}; do NODE_NAME_2=$(kubectl get nodes --selector eks.amazonaws.com/nodegroup=new_nodegroup_3 -o jsonpath='{.items[0].metadata.name}' 2>/dev/null) && [ -n "$NODE_NAME_2" ] && break || sleep 5; done && [ -n "$NODE_NAME_2" ]
info

This can take up to 1 minute. The script will store the new node name as NODE_NAME_2.

6.4. Verify node status

~$kubectl get nodes --selector=kubernetes.io/hostname=$NODE_NAME_2
NAME                                          STATUS   ROLES    AGE     VERSION
ip-10-42-180-24.us-west-2.compute.internal    Ready    <none>   0h43m   v1.30.8-eks-aeac579

Step 7: Implementing Long-term Solutions

The Dev team has identified and fixed a memory leak in the application. Let's implement the fix and establish proper resource management:

7.1. Apply the updated application configuration

~$kubectl apply -f /home/ec2-user/environment/eks-workshop/modules/troubleshooting/workernodes/yaml/configmaps-new.yaml

7.2. Set resource limits for the deployment (cpu: 500m, memory: 512Mi)

~$kubectl patch deployment prod-app -n prod --patch '{"spec":{"template":{"spec":{"containers":[{"name":"prod-app","resources":{"limits":{"cpu":"500m","memory":"512Mi"},"requests":{"cpu":"250m","memory":"256Mi"}}}]}}}}'
 

7.3. Set resource limits for the DaemonSet (cpu: 500m, memory: 512Mi)

~$kubectl patch daemonset prod-ds -n prod --patch '{"spec":{"template":{"spec":{"containers":[{"name":"prod-ds","resources":{"limits":{"cpu":"500m","memory":"512Mi"},"requests":{"cpu":"250m","memory":"256Mi"}}}]}}}}'

7.4. Perform rolling updates and scale back to desired state

~$kubectl rollout restart deployment/prod-app -n prod && kubectl rollout restart daemonset/prod-ds -n prod && kubectl scale deployment prod-app -n prod --replicas=6

Step 8: Verification

Let's verify our fixes have resolved the issues:

8.1. Check pod creations

~$kubectl get pods -n prod
NAME                        READY   STATUS    RESTARTS   AGE
prod-app-666f8f7bd5-658d6   1/1     Running   0          1m
prod-app-666f8f7bd5-6jrj4   1/1     Running   0          1m
prod-app-666f8f7bd5-9rf6m   1/1     Running   0          1m
prod-app-666f8f7bd5-pm545   1/1     Running   0          1m
prod-app-666f8f7bd5-ttkgs   1/1     Running   0          1m
prod-app-666f8f7bd5-zm8lx   1/1     Running   0          1m
prod-ds-ll4lv               1/1     Running   0          1m
 

8.2. Verify pod resource usage

~$kubectl top pods -n prod
NAME                       CPU(cores)   MEMORY(bytes)
prod-app-666f8f7bd5-658d6   215m         425Mi
prod-app-666f8f7bd5-6jrj4   203m         426Mi
prod-app-666f8f7bd5-9rf6m   203m         426Mi
prod-app-666f8f7bd5-pm545   205m         425Mi
prod-app-666f8f7bd5-ttkgs   248m         425Mi
prod-app-666f8f7bd5-zm8lx   215m         425Mi
prod-ds-ll4lv               586m         3Mi

8.3. Check node status

~$kubectl get node --selector=kubernetes.io/hostname=$NODE_NAME_2
NAME                                          STATUS   ROLES    AGE     VERSION
ip-10-42-180-24.us-west-2.compute.internal    Ready    <none>   1h35m   v1.30.8-eks-aeac579

8.4. Check node resource usage

~$kubectl top node --selector=kubernetes.io/hostname=$NODE_NAME
NAME                                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-42-180-24.us-west-2.compute.internal    1612m        83%    3145Mi          44%

Key Takeaways

1. Resource Management

  • Always set appropriate resource requests and limits
  • Monitor cumulative workload impact
  • Implement proper resource quotas

2. Monitoring

  • Use multiple monitoring tools
  • Set up proactive alerting
  • Monitor both container and node-level metrics

3. Best Practices

Additional Resources