Amazon* EKS Node Issues#
This guide describes how to debug an Amazon Elastic Kubernetes Service* (Amazon* EKS) node that is not in the proper running shape.
Symptom#
There are a few typical issues we will notice when an Amazon EKS node is not in a good shape:
- kubectl get nodesshows NotReady
- Some pods are stuck in bad states, and those pods happen to be scheduled on the same node 
- kubectl get pods -o widecan be used to tell which node a pod is scheduled on
Cause#
The cause needs to be determined by following a few steps.
Note
Most of the steps on this page require AWS* console access and full Kubernetes* platform access to the Amazon EKS cluster. This should only be done by experienced SRE / DevOps engineers who have been granted those permissions.
- Check if there is abnormal resource usage, then further dig into which process/container is not behaving correctly: - Log in to the Amazon EKS node from the AWS console 
- Check disk usage: - df -h
- Check memory usage: - free -h
- Check CPU and memory usage: - top
 - From - kubectl:- kubectl top node
- kubectl top pod -A
 
- Check if Kubelet is running & if there’s any error log: - Log in to the Amazon EKS node from the AWS console and run: - systemctl status kubelet
- sudo journalctl -xeu kubelet
 
 
Solution#
- First, try to restart any process/container that is misbehaving or consuming excessive resources. 
- Escalate to the component owner if the container still consumes excessive resources after being restarted. 
- Proceed to (2) if there is nothing consuming a lot of resources. 
- Restart Kubelet if it is not running or if there are some errors in the Kubelet log: - sudo systemctl restart kubelet
 - In some cases, the node is completely unresponsive to any restart command. Proceed to (5). 
- Reboot the entire Amazon EKS node: - This can be done on the Amazon Elastic Compute Cloud* (Amazon EC2*) console. It typically takes ~5 minutes for the node to come back up and report to Kubernetes. 
 
- Escalate if the issue persists.