Checking VPC configuration
DNS traffic between application pods, kube-dns service, and CoreDNS pods often traverses multiple nodes and VPC subnets. We need to verify that DNS traffic can flow freely at the VPC level.
Two main VPC components can filter network traffic:
- Security Groups
- Network ACLs
We should verify that both worker node Security Groups and subnet Network ACLs allow DNS traffic (port 53 UDP/TCP) in both directions.
Step 1 - Identify worker node Security Groups
Let's start by identifying the Security Groups associated with cluster worker nodes.
During cluster creation, EKS creates a cluster Security Group that's associated with both the cluster endpoint and all Managed Nodes. If no additional Security Groups are configured, this is the only Security Group controlling worker node traffic.
sg-xxxxbbda9848bxxxx
Now check for any additional Security Groups on worker nodes:
--------------------------
| DescribeInstances |
+------------------------+
| i-xxxx2e04aa2baxxxx |
| sg-xxxxbbda9848bxxxx |
| i-xxxx45e34d609xxxx |
| sg-xxxxbbda9848bxxxx |
| i-xxxxdc536ec33xxxx |
| sg-xxxxbbda9848bxxxx |
+------------------------+
We can see that worker nodes only use the cluster Security Group sg-xxxxbbda9848bxxxx
.
Step 2 - Check worker node Security Group rules
Let's examine worker node Security Group rules:
-----------------------------------------------------------------------------------------
| DescribeSecurityGroupRules |
+-----------+-----------+---------------+-----------+------------------------+----------+
| CidrIpv4 | FromPort | IsEgressRule | Protocol | SourceSG | ToPort |
+-----------+-----------+---------------+-----------+------------------------+----------+
| 0.0.0.0/0| -1 | True | -1 | None | -1 |
| None | 10250 | False | tcp | sg-0fcabbda9848b346e | 10250 |
| None | -1 | False | -1 | sg-09eca28cacae05248 | -1 |
| None | 443 | False | tcp | sg-0fcabbda9848b346e | 443 |
+-----------+-----------+---------------+-----------+------------------------+----------+
There are 3 Ingress rules and 1 Egress rule with the following details:
- Egress all protocols and ports to all IP addresses (0.0.0.0/0) - Note the value True in column IsEgressRule.
- Ingress TCP port 10250 from within this same security group (sg-0fcabbda9848b346e)
- Ingress TCP port 443 from within this same security group (sg-0fcabbda9848b346e)
- Ingress all protocols and ports from another security group (sg-09eca28cacae05248), which is not associated with worker nodes.
Notably absent are rules allowing DNS traffic (UDP/TCP port 53), explaining our DNS resolution failures.
Root Cause
When tightening cluster security, users might overly restrict the cluster Security Group rules. For proper cluster operation, DNS traffic must be allowed either through the cluster Security Group or through a separate Security Group attached to worker nodes.
In this case, the cluster Security Group only allows ports 443 and 10250, blocking DNS traffic and causing name resolution timeouts.
Resolution
Following EKS security group requirements, we'll allow all traffic within the cluster Security Group:
Recreate the application pods:
Verify all pods reach Ready state:
NAMESPACE NAME READY STATUS RESTARTS AGE
assets assets-784b5f5656-fjh7t 1/1 Running 0 50s
carts carts-5475469b7c-bwjsf 1/1 Running 0 50s
carts carts-dynamodb-69fc586887-pmkw7 1/1 Running 0 19h
catalog catalog-5578f9649b-pkdfz 1/1 Running 0 50s
catalog catalog-mysql-0 1/1 Running 0 19h
checkout checkout-84c6769ddd-d46n2 1/1 Running 0 50s
checkout checkout-redis-76bc7cb6f9-4g5qz 1/1 Running 0 23d
orders orders-6d74499d87-mh2r2 1/1 Running 0 50s
orders orders-mysql-6fbd688d4b-m7gpt 1/1 Running 0 19h
ui ui-5f4d85f85f-xnh8q 1/1 Running 0 50s
For more information, see Amazon EKS security group requirements.
While this lab focuses on Security Groups, Network ACLs can also affect traffic flow in EKS clusters. For more information about Network ACLs, see Control subnet traffic with network access control lists.
Conclusions
Throughout the multiple sections of this lab, we investigated and root caused different issues that affect DNS resolution in EKS clusters, and performed the needed steps to fix them.
In this lab, we've:
- Identified multiple issues affecting DNS resolution in our EKS cluster
- Followed a systematic troubleshooting approach to diagnose each issue
- Applied the necessary fixes to restore DNS functionality
- Verified that all application pods are now running properly
All application pods should now be in Ready state with DNS resolution working correctly.