Missing Worker Nodes
Background
Corporation XYZ is launching a new e-commerce platform in the us-west-2 region using an EKS cluster running Kubernetes version 1.30. During a security review, several gaps were identified in the cluster's security posture, particularly around node group volume encryption and AMI customization.
The security team provided specific requirements including:
- Enabling encryption for node group volumes
- Setting up best practice network configurations
- Ensuring EKS Optimized AMIs are used
- Enabling Kubernetes auditing
Sam, an engineer with Kubernetes experience but new to EKS, created a new managed node group named new_nodegroup_1 to implement these requirements. However, no new nodes are joining the cluster despite the node group creation appearing successful. Initial checks of the EKS cluster status, node group configuration, and Kubernetes events haven't revealed any obvious issues.
Step 1: Verify Node Status
Let's first verify Sam's observation about the missing nodes:
No resources found
This confirms Sam's observation - no nodes are present from the new nodegroup (new_nodegroup_1).
Step 2: Check Managed Node Group Status
Since Managed Node Groups are responsible for creating nodes, let's examine the nodegroup details. Key aspects to check:
- Node group existence
- Status and health
- Desired size
You can also view this information in the EKS Console:

Step 3: Analyze Node Group Health Status
The nodegroup should eventually transition to a DEGRADED state. Let's examine the detailed status:
If the Workernodes workshop environment was deployed within 10 minutes, you may see nodegroup in ACTIVE state. If so, please observe the output below for your information. The nodegroup should transition to DEGRADED within 10 minutes of deployment. You can proceed to Step 4 to check the AutoScaling Group directly.
{
"nodegroup": {
"nodegroupName": "new_nodegroup_1", <<<---
"nodegroupArn": "arn:aws:eks:us-west-2:1234567890:nodegroup/eks-workshop/new_nodegroup_1/abcd1234-1234-abcd-1234-1234abcd1234",
"clusterName": "eks-workshop",
...
"status": "DEGRADED", <<<---
"capacityType": "ON_DEMAND",
"scalingConfig": {
"minSize": 0,
"maxSize": 1,
"desiredSize": 1 <<<---
},
...
"resources": {
"autoScalingGroups": [
{
"name": "eks-new_nodegroup_1-abcd1234-1234-abcd-1234-1234abcd1234"
}
]
},
"health": { <<<---
"issues": [
{
"code": "AsgInstanceLaunchFailures",
"message": "Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InvalidKMSKey.InvalidState: The KMS key provided is in an incorrect state",
"resourceIds": [
"eks-new_nodegroup_1-abcd1234-1234-abcd-1234-1234abcd1234"
]
}
]
}
...
}
The health status reveals a KMS key issue preventing instance launches. This aligns with Sam's attempt to implement volume encryption.
Step 4: Investigate Auto Scaling Group Activities
Let's examine the ASG activities to understand the launch failures:
Note: For your convenience, the Autoscaling Group name is available as env variable $NEW_NODEGROUP_1_ASG_NAME.
{
"Activities": [
{
"ActivityId": "1234abcd-1234-abcd-1234-1234abcd1234",
"AutoScalingGroupName": "eks-new_nodegroup_1-abcd1234-1234-abcd-1234-1234abcd1234",
"Description": "Launching a new EC2 instance: i-1234abcd1234abcd1. Status Reason: Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InvalidKMSKey.InvalidState: The KMS key provided is in an incorrect state",
"Cause": "At 2024-10-04T18:06:36Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.",
...
"StatusCode": "Cancelled",
--->>> "StatusMessage": "Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InvalidKMSKey.InvalidState: The KMS key provided is in an incorrect state"
},
...
]
}
You can also view this information in the EKS Console. Click on the Autoscaling group name under the Details tab to view the Autoscaling activities.

Step 5: Examine Launch Template Configuration
Let's check the Launch Template for encryption settings:
5.1. Find the Launch Template ID from the ASG or managed nodegroup. In this example we will use ASG
5.2. Now we can check the encryption settings
Note: For your convenience we have added the Launch Template ID as env variable with the variable $NEW_NODEGROUP_1_LT_ID
.
{
"LaunchTemplateVersions": [
{
"LaunchTemplateId": "lt-1234abcd1234abcd1",
...
"DefaultVersion": true,
"LaunchTemplateData": {
...
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
--->>> "Encrypted": true,
--->>> "KmsKeyId": "arn:aws:kms:us-west-2:xxxxxxxxxxxx:key/xxxxxxxxxxxx",
"VolumeSize": 20,
"VolumeType": "gp2"
}
}
]
Step 6: Verify KMS Key Configuration
6.1. Let's examine the KMS key status and permissions
Note: For your convenience we have added the KMS Key ID as env variable with the variable $NEW_KMS_KEY_ID
.
{
"KeyId": "1234abcd-1234-abcd-1234-1234abcd1234",
"Enabled": true, <<<---
"KeyUsage": "ENCRYPT_DECRYPT",
"KeyState": "Enabled", <<<---
"KeyManager": "CUSTOMER"
}
You can also view this information in the KMS Console. The key will have an alias called new_kms_key_alias followed by 5 random string (e.g. new_kms_key_alias_123ab):

6.2. Check the key policy for the CMK
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::1234567890:root"
},
"Action": "kms:*",
"Resource": "*"
}
]
}
The key policy is missing required permissions for the AutoScaling service role.
Step 7: Implement Solution
7.1. Add the required KMS key policy
The policy will look similar to the below.
{
"Version": "2012-10-17",
"Id": "default",
"Statement": [
{
"Sid": "EnableIAMUserPermissions",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::1234567890:root"
},
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "AllowAutoScalingServiceRole",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::1234567890:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:DescribeKey"
],
"Resource": "*"
},
{
"Sid": "AllowAttachmentOfPersistentResources",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::1234567890:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"
},
"Action": "kms:CreateGrant",
"Resource": "*",
"Condition": {
"Bool": {
"kms:GrantIsForAWSResource": "true"
}
}
}
]
}
7.2. Scale down and scale up the node group
This can take up to 1 minute.
Step 8: Verification
Let's verify our fix has resolved the issue:
8.1. Check node group status
ACTIVE
8.2. Verify node joining
NAME STATUS ROLES AGE VERSION
ip-10-42-108-252.us-west-2.compute.internal Ready <none> 3m9s v1.30.0-eks-036c24b
Newly joined node can take up to about 1 minute to show.
Key Takeaways
Security Implementation
- Properly configure KMS key policies when implementing encryption
- Ensure service roles have necessary permissions
- Validate security configurations before deployment
Troubleshooting Process
- Follow the resource chain (Node → Node Group → ASG → Launch Template)
- Check health status and error messages at each level
- Verify service role permissions
Best Practices
- Test security implementations in non-production environments
- Document required permissions for service roles
- Implement proper error handling and monitoring