Skip to main content

Missing Worker Nodes

Background

Corporation XYZ is launching a new e-commerce platform in the us-west-2 region using an EKS cluster running Kubernetes version 1.30. During a security review, several gaps were identified in the cluster's security posture, particularly around node group volume encryption and AMI customization.

The security team provided specific requirements including:

  • Enabling encryption for node group volumes
  • Setting up best practice network configurations
  • Ensuring EKS Optimized AMIs are used
  • Enabling Kubernetes auditing

Sam, an engineer with Kubernetes experience but new to EKS, created a new managed node group named new_nodegroup_1 to implement these requirements. However, no new nodes are joining the cluster despite the node group creation appearing successful. Initial checks of the EKS cluster status, node group configuration, and Kubernetes events haven't revealed any obvious issues.

Step 1: Verify Node Status

Let's first verify Sam's observation about the missing nodes:

~$kubectl get nodes --selector=eks.amazonaws.com/nodegroup=new_nodegroup_1
No resources found
note

This confirms Sam's observation - no nodes are present from the new nodegroup (new_nodegroup_1).

Step 2: Check Managed Node Group Status

Since Managed Node Groups are responsible for creating nodes, let's examine the nodegroup details. Key aspects to check:

  • Node group existence
  • Status and health
  • Desired size
~$aws eks describe-nodegroup --cluster-name eks-workshop --nodegroup-name new_nodegroup_1
info

You can also view this information in the EKS Console:

AWS console iconOpen EKS Cluster Compute Tab

Step 3: Analyze Node Group Health Status

The nodegroup should eventually transition to a DEGRADED state. Let's examine the detailed status:

info

If the Workernodes workshop environment was deployed within 10 minutes, you may see nodegroup in ACTIVE state. If so, please observe the output below for your information. The nodegroup should transition to DEGRADED within 10 minutes of deployment. You can proceed to Step 4 to check the AutoScaling Group directly.

~$aws eks describe-nodegroup --cluster-name eks-workshop --nodegroup-name new_nodegroup_1 --query 'nodegroup.{NodegroupName:nodegroupName,Status:status,ScalingConfig:scalingConfig,AutoScalingGroups:resources.autoScalingGroups,Health:health}'
 
 
{
    "nodegroup": {
        "nodegroupName": "new_nodegroup_1", <<<---
        "nodegroupArn": "arn:aws:eks:us-west-2:1234567890:nodegroup/eks-workshop/new_nodegroup_1/abcd1234-1234-abcd-1234-1234abcd1234",
        "clusterName": "eks-workshop",
        ...
        "status": "DEGRADED",               <<<---
        "capacityType": "ON_DEMAND",
        "scalingConfig": {
            "minSize": 0,
            "maxSize": 1,
            "desiredSize": 1                <<<---
        },
        ...
        "resources": {
            "autoScalingGroups": [
                {
                    "name": "eks-new_nodegroup_1-abcd1234-1234-abcd-1234-1234abcd1234"
                }
            ]
        },
        "health": {                         <<<---
            "issues": [
                {
                    "code": "AsgInstanceLaunchFailures",
                    "message": "Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InvalidKMSKey.InvalidState: The KMS key provided is in an incorrect state",
                    "resourceIds": [
                        "eks-new_nodegroup_1-abcd1234-1234-abcd-1234-1234abcd1234"
                    ]
                }
            ]
        }
        ...
}
note

The health status reveals a KMS key issue preventing instance launches. This aligns with Sam's attempt to implement volume encryption.

Step 4: Investigate Auto Scaling Group Activities

Let's examine the ASG activities to understand the launch failures:

info

Note: For your convenience, the Autoscaling Group name is available as env variable $NEW_NODEGROUP_1_ASG_NAME.

~$aws autoscaling describe-scaling-activities --auto-scaling-group-name ${NEW_NODEGROUP_1_ASG_NAME}
 
{
    "Activities": [
        {
            "ActivityId": "1234abcd-1234-abcd-1234-1234abcd1234",
            "AutoScalingGroupName": "eks-new_nodegroup_1-abcd1234-1234-abcd-1234-1234abcd1234",
            "Description": "Launching a new EC2 instance: i-1234abcd1234abcd1.  Status Reason: Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InvalidKMSKey.InvalidState: The KMS key provided is in an incorrect state",
            "Cause": "At 2024-10-04T18:06:36Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.",
            ...
            "StatusCode": "Cancelled",
  --->>>    "StatusMessage": "Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InvalidKMSKey.InvalidState: The KMS key provided is in an incorrect state"
        },
        ...
    ]
}
info

You can also view this information in the EKS Console. Click on the Autoscaling group name under the Details tab to view the Autoscaling activities.

AWS console iconOpen EKS cluster Nodegroup Tab

Step 5: Examine Launch Template Configuration

Let's check the Launch Template for encryption settings:

5.1. Find the Launch Template ID from the ASG or managed nodegroup. In this example we will use ASG

~$aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names ${NEW_NODEGROUP_1_ASG_NAME} \
--query 'AutoScalingGroups[0].MixedInstancesPolicy.LaunchTemplate.LaunchTemplateSpecification.LaunchTemplateId' \
--output text

5.2. Now we can check the encryption settings

info

Note: For your convenience we have added the Launch Template ID as env variable with the variable $NEW_NODEGROUP_1_LT_ID.

~$aws ec2 describe-launch-template-versions --launch-template-id ${NEW_NODEGROUP_1_LT_ID} --query 'LaunchTemplateVersions[].{LaunchTemplateId:LaunchTemplateId,DefaultVersion:DefaultVersion,BlockDeviceMappings:LaunchTemplateData.BlockDeviceMappings}'
 
{
    "LaunchTemplateVersions": [
        {
            "LaunchTemplateId": "lt-1234abcd1234abcd1",
            ...
            "DefaultVersion": true,
            "LaunchTemplateData": {
            ...
                "BlockDeviceMappings": [
                    {
                        "DeviceName": "/dev/xvda",
                        "Ebs": {
    --->>>                 "Encrypted": true,
    --->>>                 "KmsKeyId": "arn:aws:kms:us-west-2:xxxxxxxxxxxx:key/xxxxxxxxxxxx",
                            "VolumeSize": 20,
                            "VolumeType": "gp2"
                        }
                    }
                ]

Step 6: Verify KMS Key Configuration

6.1. Let's examine the KMS key status and permissions

info

Note: For your convenience we have added the KMS Key ID as env variable with the variable $NEW_KMS_KEY_ID.

~$aws kms describe-key --key-id ${NEW_KMS_KEY_ID} --query 'KeyMetadata.{KeyId:KeyId,Enabled:Enabled,KeyUsage:KeyUsage,KeyState:KeyState,KeyManager:KeyManager}'
 
{
    "KeyId": "1234abcd-1234-abcd-1234-1234abcd1234",
    "Enabled": true,                                 <<<---
    "KeyUsage": "ENCRYPT_DECRYPT",
    "KeyState": "Enabled",                           <<<---
    "KeyManager": "CUSTOMER"
}
info

You can also view this information in the KMS Console. The key will have an alias called new_kms_key_alias followed by 5 random string (e.g. new_kms_key_alias_123ab):

AWS console iconOpen KMS Customer managed keys

6.2. Check the key policy for the CMK

~$aws kms get-key-policy --key-id ${NEW_KMS_KEY_ID} | jq -r '.Policy | fromjson'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::1234567890:root"
      },
      "Action": "kms:*",
      "Resource": "*"
    }
  ]
}
info

The key policy is missing required permissions for the AutoScaling service role.

Step 7: Implement Solution

7.1. Add the required KMS key policy

~$NEW_POLICY=$(echo '{"Version":"2012-10-17","Id":"default","Statement":[{"Sid":"EnableIAMUserPermissions","Effect":"Allow","Principal":{"AWS":"arn:aws:iam::'"$AWS_ACCOUNT_ID"':root"},"Action":"kms:*","Resource":"*"},{"Sid":"AllowAutoScalingServiceRole","Effect":"Allow","Principal":{"AWS":"arn:aws:iam::'"$AWS_ACCOUNT_ID"':role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"},"Action":["kms:Encrypt","kms:Decrypt","kms:ReEncrypt*","kms:GenerateDataKey*","kms:DescribeKey"],"Resource":"*"},{"Sid":"AllowAttachmentOfPersistentResources","Effect":"Allow","Principal":{"AWS":"arn:aws:iam::'"$AWS_ACCOUNT_ID"':role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"},"Action":"kms:CreateGrant","Resource":"*","Condition":{"Bool":{"kms:GrantIsForAWSResource":"true"}}}]}') && aws kms put-key-policy --key-id "$NEW_KMS_KEY_ID" --policy-name default --policy "$NEW_POLICY" && aws kms get-key-policy --key-id "$NEW_KMS_KEY_ID" --policy-name default | jq -r '.Policy | fromjson'
note

The policy will look similar to the below.

{
"Version": "2012-10-17",
"Id": "default",
"Statement": [
{
"Sid": "EnableIAMUserPermissions",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::1234567890:root"
},
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "AllowAutoScalingServiceRole",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::1234567890:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:DescribeKey"
],
"Resource": "*"
},
{
"Sid": "AllowAttachmentOfPersistentResources",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::1234567890:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"
},
"Action": "kms:CreateGrant",
"Resource": "*",
"Condition": {
"Bool": {
"kms:GrantIsForAWSResource": "true"
}
}
}
]
}

7.2. Scale down and scale up the node group

~$aws eks update-nodegroup-config --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_1 --scaling-config desiredSize=0 && aws eks wait nodegroup-active --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_1 && aws eks update-nodegroup-config --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_1 --scaling-config desiredSize=1 && aws eks wait nodegroup-active --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_1
info

This can take up to 1 minute.

Step 8: Verification

Let's verify our fix has resolved the issue:

8.1. Check node group status

~$aws eks describe-nodegroup --cluster-name ${EKS_CLUSTER_NAME} --nodegroup-name new_nodegroup_1 --query 'nodegroup.status' --output text
ACTIVE

8.2. Verify node joining

~$kubectl get nodes --selector=eks.amazonaws.com/nodegroup=new_nodegroup_1
NAME                                          STATUS   ROLES    AGE    VERSION
ip-10-42-108-252.us-west-2.compute.internal   Ready    <none>   3m9s   v1.30.0-eks-036c24b
info

Newly joined node can take up to about 1 minute to show.

Key Takeaways

Security Implementation

  • Properly configure KMS key policies when implementing encryption
  • Ensure service roles have necessary permissions
  • Validate security configurations before deployment

Troubleshooting Process

  • Follow the resource chain (Node → Node Group → ASG → Launch Template)
  • Check health status and error messages at each level
  • Verify service role permissions

Best Practices

  • Test security implementations in non-production environments
  • Document required permissions for service roles
  • Implement proper error handling and monitoring

Additional Resources