Node in Not-Ready state
Background
Corporation XYZ's DevOps team has deployed a new node group and the application team deployed a new application outside of the retail-app, including a deployment (prod-app) and its supporting DaemonSet (prod-ds).
After deploying these applications, the monitoring team has reported that the node is transitioning to a NotReady state. The root cause isn't immediately apparent, and as the DevOps on-call engineer, you need to investigate why the node is becoming unresponsive and implement a solution to restore normal operation.
Step 1: Verify Node Status
Let's first verify the node's status to confirm the current state:
NAME STATUS ROLES AGE VERSION
ip-10-42-180-244.us-west-2.compute.internal NotReady <none> 15m v1.27.1-eks-2f008fe
Note: For your convenience, we have added the node name as the environment variable $NODE_NAME.
Step 2: Check System Pod Status
Let's examine the status of kube-system pods on the affected node to identify any system-level issues:
This command will show us all kube-system pods running on the affected node, helping us identify any potential issues of the node caused by these. You should note that all the pods are in running state.
Step 3: Examine Node Conditions
Let's examine the node's describe output to understand the cause of the NotReady state.
Taints: node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Wed, 12 Feb 2025 15:20:21 +0000 Wed, 12 Feb 2025 15:21:04 +0000 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 12 Feb 2025 15:20:21 +0000 Wed, 12 Feb 2025 15:21:04 +0000 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Wed, 12 Feb 2025 15:20:21 +0000 Wed, 12 Feb 2025 15:21:04 +0000 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Wed, 12 Feb 2025 15:20:21 +0000 Wed, 12 Feb 2025 15:21:04 +0000 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 3m18s kube-proxy
Normal Starting 3m31s kubelet Starting kubelet.
Warning InvalidDiskCapacity 3m31s kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 3m31s (x2 over 3m31s) kubelet Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 3m31s (x2 over 3m31s) kubelet Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 3m31s (x2 over 3m31s) kubelet Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 3m31s kubelet Updated Node Allocatable limit across pods
Normal RegisteredNode 3m27s node-controller Node ip-10-42-180-244.us-west-2.compute.internal event: Registered Node ip-10-42-180-244.us-west-2.compute.internal in Controller
Normal Synced 3m27s cloud-node-controller Node synced successfully
Normal ControllerVersionNotice 3m12s vpc-resource-controller The node is managed by VPC resource controller version v1.6.3
Normal NodeReady 3m10s kubelet Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeReady
Normal NodeTrunkInitiated 3m8s vpc-resource-controller The node has trunk interface initialized successfully
Warning SystemOOM 94s kubelet System OOM encountered, victim process: python, pid: 4763
Normal NodeNotReady 52s node-controller Node ip-10-42-180-244.us-west-2.compute.internal status is now: NodeNotReady
Here we see that the Node's kubelet is in the Unknown state and cannot be reached. You can read more about this status from the Kubernetes documentation.
The node has the following taints:
- node.kubernetes.io/unreachable:NoExecute: Indicates pods will be evicted if they don't tolerate this taint
- node.kubernetes.io/unreachable:NoSchedule: Prevents new pods from being scheduled
The node conditions show that the kubelet has stopped posting status updates, which can typically indicate severe resource constraints or system instability.
Step 4: Analyzing Resource Usage
Let's examine the resource utilization of our workloads using a monitoring tool.
The metrics-server has already been installed in your cluster to provide resource usage data.
4.1. First, check node-level metrics
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-42-142-116.us-west-2.compute.internal 34m 1% 940Mi 13%
ip-10-42-185-41.us-west-2.compute.internal 27m 1% 1071Mi 15%
ip-10-42-96-176.us-west-2.compute.internal 175m 9% 2270Mi 32%
ip-10-42-180-244.us-west-2.compute.internal <unknown> <unknown> <unknown> <unknown>
4.2. Next, attempt to check pod metrics
error: Metrics not available for pod prod/prod-app-xx-xx, age: 17m14.466020856s
We can observe that:
- The troubled node shows unknown for all metrics
- Other nodes are operating normally with moderate resource usage
- Pod metrics in the prod namespace are unavailable
Step 5: CloudWatch Metrics Investigation
Since Metrics Server isn't providing data, let's use CloudWatch to check EC2 instance metrics:
For your convenience, the instance ID of the worker node in newnodegroup_3 has been stored as an environment variable $INSTANCEID.
{
"MetricDataResults": [
{
"Id": "cpu",
"Label": "CPUUtilization",
"Timestamps": [
"2025-02-12T16:25:00+00:00",
"2025-02-12T16:20:00+00:00",
"2025-02-12T16:15:00+00:00",
"2025-02-12T16:10:00+00:00"
],
"Values": [
99.87333333333333,
99.89633636636336,
99.86166666666668,
62.67880324995537
],
"StatusCode": "Complete"
}
],
"Messages": []
}
The CloudWatch metrics reveal:
- CPU utilization consistently above 99%
- Significant increase in resource usage over time
- Clear indication of resource exhaustion
Step 6: Mitigate Impact
Let's check deployment details and implement immediate changes to stabilize the node:
6.1. Check the deployment resource configurations
NAME CPU_REQUEST MEM_REQUEST CPU_LIMIT MEM_LIMIT
prod-app-74b97f9d85-k6c84 100m 64Mi <none> <none>
prod-app-74b97f9d85-mpcrv 100m 64Mi <none> <none>
prod-app-74b97f9d85-wdqlr 100m 64Mi <none> <none>
...
...
prod-ds-558sx 100m 128Mi <none> <none>
Notice that neither the deployment nor the DaemonSet has resource limits configured, which allowed unconstrained resource consumption.
6.2. Let's scale down the deployment and stop the resource overload
6.3. Recycle the node on the nodegroup
This can take up to 1 minute. The script will store the new node name as NODE_NAME_2.
6.4. Verify node status
NAME STATUS ROLES AGE VERSION
ip-10-42-180-24.us-west-2.compute.internal Ready <none> 0h43m v1.30.8-eks-aeac579
Step 7: Implementing Long-term Solutions
The Dev team has identified and fixed a memory leak in the application. Let's implement the fix and establish proper resource management:
7.1. Apply the updated application configuration
7.2. Set resource limits for the deployment (cpu: 500m, memory: 512Mi)
7.3. Set resource limits for the DaemonSet (cpu: 500m, memory: 512Mi)
7.4. Perform rolling updates and scale back to desired state
Step 8: Verification
Let's verify our fixes have resolved the issues:
8.1. Check pod creations
NAME READY STATUS RESTARTS AGE
prod-app-666f8f7bd5-658d6 1/1 Running 0 1m
prod-app-666f8f7bd5-6jrj4 1/1 Running 0 1m
prod-app-666f8f7bd5-9rf6m 1/1 Running 0 1m
prod-app-666f8f7bd5-pm545 1/1 Running 0 1m
prod-app-666f8f7bd5-ttkgs 1/1 Running 0 1m
prod-app-666f8f7bd5-zm8lx 1/1 Running 0 1m
prod-ds-ll4lv 1/1 Running 0 1m
8.2. Verify pod resource usage
NAME CPU(cores) MEMORY(bytes)
prod-app-666f8f7bd5-658d6 215m 425Mi
prod-app-666f8f7bd5-6jrj4 203m 426Mi
prod-app-666f8f7bd5-9rf6m 203m 426Mi
prod-app-666f8f7bd5-pm545 205m 425Mi
prod-app-666f8f7bd5-ttkgs 248m 425Mi
prod-app-666f8f7bd5-zm8lx 215m 425Mi
prod-ds-ll4lv 586m 3Mi
8.3. Check node status
NAME STATUS ROLES AGE VERSION
ip-10-42-180-24.us-west-2.compute.internal Ready <none> 1h35m v1.30.8-eks-aeac579
8.4. Check node resource usage
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-42-180-24.us-west-2.compute.internal 1612m 83% 3145Mi 44%
Key Takeaways
1. Resource Management
- Always set appropriate resource requests and limits
- Monitor cumulative workload impact
- Implement proper resource quotas
2. Monitoring
- Use multiple monitoring tools
- Set up proactive alerting
- Monitor both container and node-level metrics
3. Best Practices
- Implement horizontal pod autoscaling
- Use autoscaling: Cluster-autoscaler, Karpenter, EKS Auto Mode
- Regular capacity planning
- Implement proper error handling in applications