Missing Worker Nodes

Background

Corporation XYZ is launching a new e-commerce platform in the us-west-2 region using an EKS cluster running Kubernetes version 1.30. During a security review, several gaps were identified in the cluster's security posture, particularly around node group volume encryption and AMI customization.

The security team provided specific requirements including:

Enabling encryption for node group volumes
Setting up best practice network configurations
Ensuring EKS Optimized AMIs are used
Enabling Kubernetes auditing

Sam, an engineer with Kubernetes experience but new to EKS, created a new managed node group named new_nodegroup_1 to implement these requirements. However, no new nodes are joining the cluster despite the node group creation appearing successful. Initial checks of the EKS cluster status, node group configuration, and Kubernetes events haven't revealed any obvious issues.

Step 1: Verify Node Status

Let's first verify Sam's observation about the missing nodes:

~$kubectl get nodes --selector=eks.amazonaws.com/nodegroup=new_nodegroup_1

No resources found

note

This confirms Sam's observation - no nodes are present from the new nodegroup (new_nodegroup_1).

Step 2: Check Managed Node Group Status

Since Managed Node Groups are responsible for creating nodes, let's examine the nodegroup details. Key aspects to check:

Node group existence
Status and health
Desired size

~$aws eks describe-nodegroup --cluster-name $EKS_CLUSTER_NAME --nodegroup-name new_nodegroup_1

info

You can also view this information in the EKS Console:

Open EKS Cluster Compute Tab

Step 3: Analyze Node Group Health Status

The nodegroup should eventually transition to a DEGRADED state. Let's examine the detailed status:

info

If the Workernodes workshop environment was deployed within 10 minutes, you may see nodegroup in ACTIVE state. If so, please observe the output below for your information. The nodegroup should transition to DEGRADED within 10 minutes of deployment. You can proceed to Step 4 to check the AutoScaling Group directly.

~$aws eks describe-nodegroup --cluster-name $EKS_CLUSTER_NAME --nodegroup-name new_nodegroup_1 --query 'nodegroup.{NodegroupName:nodegroupName,Status:status,ScalingConfig:scalingConfig,AutoScalingGroups:resources.autoScalingGroups,Health:health}'

    "nodegroup": {

        "nodegroupName": "new_nodegroup_1", <<<---

        "nodegroupArn": "arn:aws:eks:us-west-2:1234567890:nodegroup/eks-workshop/new_nodegroup_1/abcd1234-1234-abcd-1234-1234abcd1234",

        "clusterName": "eks-workshop",

...

        "status": "DEGRADED",               <<<---

        "capacityType": "ON_DEMAND",

        "scalingConfig": {

            "minSize": 0,

            "maxSize": 1,

            "desiredSize": 1                <<<---

},

...

        "resources": {

            "autoScalingGroups": [

                    "name": "eks-new_nodegroup_1-abcd1234-1234-abcd-1234-1234abcd1234"

},

        "health": {                         <<<---

            "issues": [

                    "code": "AsgInstanceLaunchFailures",

                    "message": "Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InvalidKMSKey.InvalidState: The KMS key provided is in an incorrect state",

                    "resourceIds": [

                        "eks-new_nodegroup_1-abcd1234-1234-abcd-1234-1234abcd1234"

...

note

The health status reveals a KMS key issue preventing instance launches. This aligns with Sam's attempt to implement volume encryption.

Step 4: Investigate Auto Scaling Group Activities

Let's examine the ASG activities to understand the launch failures:

4.1. Identify Nodegroup's Auto Scaling Group Name

Run the below command to capture Nodegroup Autoscale Group name as NEW_NODEGROUP_1_ASG_NAME.

~$NEW_NODEGROUP_1_ASG_NAME=$(aws eks describe-nodegroup --cluster-name $EKS_CLUSTER_NAME --nodegroup-name new_nodegroup_1 --query 'nodegroup.resources.autoScalingGroups[0].name' --output text)

echo $NEW_NODEGROUP_1_ASG_NAME

4.2. Check the AutoScaling Activities

~$aws autoscaling describe-scaling-activities --auto-scaling-group-name ${NEW_NODEGROUP_1_ASG_NAME}

    "Activities": [

            "ActivityId": "1234abcd-1234-abcd-1234-1234abcd1234",

            "AutoScalingGroupName": "eks-new_nodegroup_1-abcd1234-1234-abcd-1234-1234abcd1234",

            "Description": "Launching a new EC2 instance: i-1234abcd1234abcd1.  Status Reason: Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InvalidKMSKey.InvalidState: The KMS key provided is in an incorrect state",

            "Cause": "At 2024-10-04T18:06:36Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.",

...

            "StatusCode": "Cancelled",

  --->>>    "StatusMessage": "Instance became unhealthy while waiting for instance to be in InService state. Termination Reason: Client.InvalidKMSKey.InvalidState: The KMS key provided is in an incorrect state"

},

...

info

You can also view this information in the EKS Console. Click on the Autoscaling group name under the Details tab to view the Autoscaling activities.

Open EKS cluster Nodegroup Tab

Step 5: Examine Launch Template Configuration

Let's check the Launch Template for encryption settings:

5.1. Find the Launch Template ID from the ASG or managed nodegroup. In this example we will use ASG

~$aws autoscaling describe-auto-scaling-groups \

--auto-scaling-group-names ${NEW_NODEGROUP_1_ASG_NAME} \

--query 'AutoScalingGroups[0].MixedInstancesPolicy.LaunchTemplate.LaunchTemplateSpecification.LaunchTemplateId' \

--output text

5.2. Now we can check the encryption settings

info

Note: For your convenience we have added the Launch Template ID as env variable with the variable $NEW_NODEGROUP_1_LT_ID.

~$aws ec2 describe-launch-template-versions --launch-template-id ${NEW_NODEGROUP_1_LT_ID} --query 'LaunchTemplateVersions[].{LaunchTemplateId:LaunchTemplateId,DefaultVersion:DefaultVersion,BlockDeviceMappings:LaunchTemplateData.BlockDeviceMappings}'

    "LaunchTemplateVersions": [

            "LaunchTemplateId": "lt-1234abcd1234abcd1",

...

            "DefaultVersion": true,

            "LaunchTemplateData": {

...

                "BlockDeviceMappings": [

                        "DeviceName": "/dev/xvda",

                        "Ebs": {

    --->>>                 "Encrypted": true,

    --->>>                 "KmsKeyId": "arn:aws:kms:us-west-2:xxxxxxxxxxxx:key/xxxxxxxxxxxx",

                            "VolumeSize": 20,

                            "VolumeType": "gp2"

Step 6: Verify KMS Key Configuration

6.1. Let's examine the KMS key status and permissions

info

Note: For your convenience we have added the KMS Key ID as env variable with the variable $NEW_KMS_KEY_ID.

~$aws kms describe-key --key-id ${NEW_KMS_KEY_ID} --query 'KeyMetadata.{KeyId:KeyId,Enabled:Enabled,KeyUsage:KeyUsage,KeyState:KeyState,KeyManager:KeyManager}'

    "KeyId": "1234abcd-1234-abcd-1234-1234abcd1234",

    "Enabled": true,                                 <<<---

    "KeyUsage": "ENCRYPT_DECRYPT",

    "KeyState": "Enabled",                           <<<---

    "KeyManager": "CUSTOMER"

info

You can also view this information in the KMS Console. The key will have an alias called new_kms_key_alias followed by 5 random string (e.g. new_kms_key_alias_123ab):

Open KMS Customer managed keys

6.2. Check the key policy for the CMK

~$aws kms get-key-policy --key-id ${NEW_KMS_KEY_ID} | jq -r '.Policy | fromjson'

  "Version": "2012-10-17",

  "Statement": [

      "Effect": "Allow",

      "Principal": {

        "AWS": "arn:aws:iam::1234567890:root"

},

      "Action": "kms:*",

      "Resource": "*"

info

The key policy is missing required permissions for the AutoScaling service role.

Step 7: Implement Solution

7.1. Add the required KMS key policy

~$NEW_POLICY=$(echo '{"Version":"2012-10-17","Id":"default","Statement":[{"Sid":"EnableIAMUserPermissions","Effect":"Allow","Principal":{"AWS":"arn:aws:iam::'"$AWS_ACCOUNT_ID"':root"},"Action":"kms:*","Resource":"*"},{"Sid":"AllowAutoScalingServiceRole","Effect":"Allow","Principal":{"AWS":"arn:aws:iam::'"$AWS_ACCOUNT_ID"':role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"},"Action":["kms:Encrypt","kms:Decrypt","kms:ReEncrypt*","kms:GenerateDataKey*","kms:DescribeKey"],"Resource":"*"},{"Sid":"AllowAttachmentOfPersistentResources","Effect":"Allow","Principal":{"AWS":"arn:aws:iam::'"$AWS_ACCOUNT_ID"':role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"},"Action":"kms:CreateGrant","Resource":"*","Condition":{"Bool":{"kms:GrantIsForAWSResource":"true"}}}]}') && aws kms put-key-policy --key-id "$NEW_KMS_KEY_ID" --policy-name default --policy "$NEW_POLICY" && aws kms get-key-policy --key-id "$NEW_KMS_KEY_ID" --policy-name default | jq -r '.Policy | fromjson'

note

The policy will look similar to the below.

{
  "Version": "2012-10-17",
  "Id": "default",
  "Statement": [
    {
      "Sid": "EnableIAMUserPermissions",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::1234567890:root"
      },
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "AllowAutoScalingServiceRole",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::1234567890:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"
      },
      "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*",
        "kms:DescribeKey"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowAttachmentOfPersistentResources",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::1234567890:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling"
      },
      "Action": "kms:CreateGrant",
      "Resource": "*",
      "Condition": {
        "Bool": {
          "kms:GrantIsForAWSResource": "true"
        }
      }
    }
  ]
}

7.2. Scale down and scale up the node group

~$aws eks update-nodegroup-config --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_1 --scaling-config desiredSize=0 && aws eks wait nodegroup-active --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_1 && aws eks update-nodegroup-config --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_1 --scaling-config desiredSize=1 && aws eks wait nodegroup-active --cluster-name "${EKS_CLUSTER_NAME}" --nodegroup-name new_nodegroup_1

info

This can take up to 1 minute.

Step 8: Verification

Let's verify our fix has resolved the issue:

8.1. Check node group status

~$aws eks describe-nodegroup --cluster-name ${EKS_CLUSTER_NAME} --nodegroup-name new_nodegroup_1 --query 'nodegroup.status' --output text

ACTIVE

8.2. Verify node joining

~$kubectl get nodes --selector=eks.amazonaws.com/nodegroup=new_nodegroup_1

NAME                                          STATUS   ROLES    AGE    VERSION

ip-10-42-108-252.us-west-2.compute.internal   Ready    <none>   3m9s   v1.30.0-eks-036c24b

info

Newly joined node can take up to about 1 minute to show.

Key Takeaways

Security Implementation

Properly configure KMS key policies when implementing encryption
Ensure service roles have necessary permissions
Validate security configurations before deployment

Troubleshooting Process

Follow the resource chain (Node → Node Group → ASG → Launch Template)
Check health status and error messages at each level
Verify service role permissions

Best Practices

Test security implementations in non-production environments
Document required permissions for service roles
Implement proper error handling and monitoring

Background​

Step 1: Verify Node Status​

Step 2: Check Managed Node Group Status​

Step 3: Analyze Node Group Health Status​

Step 4: Investigate Auto Scaling Group Activities​

4.1. Identify Nodegroup's Auto Scaling Group Name​

4.2. Check the AutoScaling Activities​

Step 5: Examine Launch Template Configuration​

5.1. Find the Launch Template ID from the ASG or managed nodegroup. In this example we will use ASG​

5.2. Now we can check the encryption settings​

Step 6: Verify KMS Key Configuration​

6.1. Let's examine the KMS key status and permissions​

6.2. Check the key policy for the CMK​

Step 7: Implement Solution​

7.1. Add the required KMS key policy​

7.2. Scale down and scale up the node group​

Step 8: Verification​

8.1. Check node group status​

8.2. Verify node joining​

Key Takeaways​

Security Implementation​

Troubleshooting Process​

Best Practices​

Additional Resources​

Background

Step 1: Verify Node Status

Step 2: Check Managed Node Group Status

Step 3: Analyze Node Group Health Status

Step 4: Investigate Auto Scaling Group Activities

4.1. Identify Nodegroup's Auto Scaling Group Name

4.2. Check the AutoScaling Activities

Step 5: Examine Launch Template Configuration

5.1. Find the Launch Template ID from the ASG or managed nodegroup. In this example we will use ASG

5.2. Now we can check the encryption settings

Step 6: Verify KMS Key Configuration

6.1. Let's examine the KMS key status and permissions

6.2. Check the key policy for the CMK

Step 7: Implement Solution

7.1. Add the required KMS key policy

7.2. Scale down and scale up the node group

Step 8: Verification

8.1. Check node group status

8.2. Verify node joining

Key Takeaways

Security Implementation

Troubleshooting Process

Best Practices

Additional Resources