Debugging pods stuck in Init / ContainerCreating state

Hazmei, eksawskubernetes
Back

Here at Ascenda Loyalty, we are using AWS EKS to run our applications.

What is EKS?

EKS aka Elastic Kubernetes Service is a managed kubernetes service offered by AWS. AWS helps to manage the control plane of the kubernetes cluster while you manage the data plane.

Some background info.

We are using security group for pods for the application pods and m5a.xlarge EC2 instance. This allows us to maintain the high security barrier between the pods and AWS resources (ie. RDS, ElastiCache). Security for pods is only supported by most nitro based Amazon EC2 instance families and has a lower limit of max pods (only 18 pods can use the security group in an m5a.xlarge instance type).

What happened?

Pods have been randomly staying in Init / ContainerCreating state for more than 10 minutes after the pod replicas were increased.

Figure 1. kubectl logs -o wide -A output

So... What gives?
The pods are unable to obtain any private IPv4 address from the CNI. This causes it to stay in Init/ContainerCreating state until an IP is assigned. We can rule out scheduling issue as the pods managed to get scheduled on the nodes.

The first thing that comes to mind is to check the available private IPv4 address in the subnet if it is fully exhausted.

Figure 2. Screenshot of the available IPv4 address in the subnets

This is not the case as shown in figure 2.

The next thing that comes to mind is that the branch network interface (pod eni) is at its limit in the affected worker nodes.

It doesn't seem that we've maxed our usage for branch-eni. 🤔

Let's dig a little further elsewhere since this is related to the pod not getting any ip address. One thing that came to mind is the AWS CNI that we use. The version used at that time was version 1.7.10. There might be a bug in the version that we've deployed that cause these random failure.

A quick google search brought us here. Most of the solution points to upgrading the AWS CNI to version ≥ v1.7.7 (which we're already on). There were also other comments stating that certain environment variables needed to be set to use security group for pods (which we did correctly). AWS CNI has newer released at that time with latest being v1.9.0 and with no options left, we upgraded to the latest CNI version.

Everything seems fine for a few hours until the same error returns to haunt us.
raging-panda

Fast forward

After opening up AWS support ticket and going back and forth with the AWS engineer, we found that it was indeed due to the max pod eni. Our usage of the security group for pods were ultimately causing this error failed to assign an IP address to container.
facepalm

Although there are shortfall in using security group for pods in EKS (lesser number of pods per nodes), we're still using it to maintain the high level of security between different AWS resources such as RDS and Elastic MemCache.

Why didn’t we notice that we ran out of pod eni in the first place?

For each application, we deploy a kubernetes job that runs a db migration step before deploying a set of webapp and worker pods. These consumes pod eni as they are using security group per pod.

When we first check if we’re hitting the limit of pod eni, we execute these commands:

Upon further inspection of the output from kubectl describe nodes <node-name>, there’s a discrepancy between the reported allocated resource for vpc.amazonaws.com/pod-eni and the number of pods that uses the pod eni. We can verify this with the following command: kubectl get pods -o wide -A | grep <node-name> and count the number of pods that uses security group for pods.

Figure 3. List of pods scheduled in affected node

Counting all the pods which uses security group in Figure 3, we are way past the pod eni limit of 18. Compare the total number of pods and allocated pod eni and there's a discrepancy in the reported number shown here when running kubectl describe node <node-name>. The discrepancy can be between 1 - 6 pods.

What’s causing these discrepancy?

It’s the db migration jobs. These uses kubernetes job and the security group for pods. On completion, the pod eni allocated does not get detached and this does not get reflected properly in the output of kubectl describe node <node-name>. That command only reports running pods and does not include completed pods.

What now?

These are some of the possible solutions:

  1. Specify the .spec.ttlSecondsAfterFinished in the job manifest.
    Not possible at the moment for us. This feature is currently in alpha stage on Kubernetes v1.19. EKS does not enable features pre-beta.
  2. ☑️ Set the CI/CD system to delete the kubernetes job after it successfully completed.
    This is the suitable solution for us. We can remove the successful job since it doesn’t serve any purpose keeping it around and consuming 1 pod eni per pod.
  3. Run the db migration job as part of the webapp initcontainer.
    We would be freeing up one pod eni per application since it’ll be running in the same pod. However, this requires a bit of work on our CI/CD, helm charts and we would have a bit of uncertainty on the impact to the system.

Update

27 September 2021
We have since updated the CI/CD pipeline to remove the kubernetes job on completion and it's been running for a week without any incident especially with the increased number of pods in the cluster! 🎉

© Hazmei Bin Abdul Rahman.RSS