I am facing a weird issue with my pods. I am launching around 20 pods in my env and every time some random 3-4 pods out of them hang with Init:0/1 status. On checking the status of pod, Init container shows running status, which should terminate after task is finished, and app container shows Waiting/Pod Initializing stage. Same init container image and specs are being used in across all 20 pods but this issue is happening with some random pods every time. And on terminating these stuck pods, it stucks in Terminating state. If i ssh on node at which this pod is launched and run docker ps, it shows me init container in running state but on running docker exec it throws error that container doesn't exist. This init container is pulling configs from Consul Server and on checking volume (got from docker inspect), i found that it has pulled all the key-val pairs correctly and saved it in defined file name. I have checked resources on all the nodes and more than enough is available on all.
Below is detailed example of on the pod acting like this.
Kubectl Version :
kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.0", GitCommit:"925c127ec6b946659ad0fd596fa959be43f0cc05", GitTreeState:"clean", BuildDate:"2017-12-15T21:07:38Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T09:42:01Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Pods :
kubectl get pods -n dev1|grep -i session-service
session-service-app-75c9c8b5d9-dsmhp 0/1 Init:0/1 0 10h
session-service-app-75c9c8b5d9-vq98k 0/1 Terminating 0 11h
Pods Status :
kubectl describe pods session-service-app-75c9c8b5d9-dsmhp -n dev1
Name: session-service-app-75c9c8b5d9-dsmhp
Namespace: dev1
Node: ip-192-168-44-18.ap-southeast-1.compute.internal/
Start Time: Fri, 27 Apr 2018 18:14:43 +0530
Labels: app=session-service-app
Status: Pending
Controlled By: ReplicaSet/session-service-app-75c9c8b5d9
Init Containers:
Container ID: docker://c658d59995636e39c9d03b06e4973b6e32f818783a21ad292a2cf20d0e43bb02
Image ID: docker-pullable://
Port: <none>
State: Running
Started: Fri, 27 Apr 2018 18:14:44 +0530
Ready: False
Restart Count: 0
POD: sand
SERVICE: session-service-app
ENV: dev1
/var/lib/app from shared-volume-sidecar (rw)
/var/run/secrets/ from default-token-bthkv (ro)
Container ID:
Image ID:
Port: 8080/TCP
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
/etc/appenv from shared-volume-sidecar (rw)
/var/run/secrets/ from default-token-bthkv (ro)
Type Status
Initialized False
Ready False
PodScheduled True
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Type: Secret (a volume populated by a Secret)
SecretName: default-token-bthkv
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: for 300s for 300s
Events: <none>
Container Status on Node :
sudo docker ps|grep -i session
c658d5999563 "/usr/bin/consul-t..." 10 hours ago Up 10 hours k8s_initpullconsulconfig_session-service-app-75c9c8b5d9-dsmhp_dev1_c2075f2a-4a18-11e8-88e7-02929cc89ab6_0
da120abd3dbb "/pause" 10 hours ago Up 10 hours k8s_POD_session-service-app-75c9c8b5d9-dsmhp_dev1_c2075f2a-4a18-11e8-88e7-02929cc89ab6_0
f53d48c7d6ec "/usr/bin/consul-t..." 10 hours ago Up 10 hours k8s_initpullconsulconfig_session-service-app-75c9c8b5d9-vq98k_dev1_42837d12-4a12-11e8-88e7-02929cc89ab6_0
c26415458d39 "/pause" 10 hours ago Up 10 hours k8s_POD_session-service-app-75c9c8b5d9-vq98k_dev1_42837d12-4a12-11e8-88e7-02929cc89ab6_0
On running Docker exec (same result with kubectl exec) :
sudo docker exec -it c658d5999563 bash
rpc error: code = 2 desc = containerd: container not found
My problem was related to the ebs-csi-controller (AWS EKS 1.24)
The ebs addin needs access to a role, and in my case the role trust relationship was broken. It uses OIDC, so I had to add my cluster's OIDC provider manually into the IAM identity provider section
kubectl logs deployment/ebs-csi-controller -n kube-system -c ebs-plugin
helped diagnose this, as well as
A Pod can be stuck in Init status due to many reasons.
PodInitializing or Init Status means that the Pod contains an Init container that hasn't finalized (Init containers: specialized containers that run before app containers in a Pod, init containers can contain utilities or setup scripts). If the pods status is ´Init:0/1´ means one init container is not finalized; init:N/M
means the Pod has M Init Containers, and N have completed so far.
For those scenario the best would be to gather information, as the root cause can be different in every PodInitializing issue.
kubectl describe pods pod-XXX
with this command you can get many info of the pod, you can check if there's any meaningful event as well. Save the init container name
kubectl logs pod-XXX
this command prints the logs for a container in a pod or specified resource.
kubectl logs pod-XXX -c init-container-xxx
This is the most accurate as could print the logs of the init container. You can get the init container name describing the pod in order to replace "init-container-XXX" as for example to "copy-default-config" as below:
The output of kubectl logs pod-XXX -c init-container-xxx
can thrown meaningful info of the issue, reference:
In the image above we can see that the root cause is that the init container can't download the plugins from jenkins (timeout), here now we can check connection config, proxy, dns; or just modify the yaml to deploy the container without the plugins.
kubectl describe node node-XXX
describing the pod will give you the name of its node, which you can also inspect with this command.
kubectl get events
to list the cluster events.
journalctl -xeu kubelet | tail -n 10
kubelet logs on systemd (journalctl -xeu docker | tail -n 1
for docker).
The solutions depends on the information gathered, once the root cause is found.
When you find a log with an insight of the root cause, you can investigate that specific root cause.
Some examples:
1 > In there this happened when init container was deleted, can be fixed deleting the pod so it would be recreated, or redeploy it. Same scenario in 1.1.
2 > If you found "bad address 'kube-dns.kube-system'" the PVC may not be recycled correctly, solution provided in 2 is running /opt/kubernetes/bin/
3 > There, a sh file was not found, the solution would be to modify the yaml file or remove the container if unnecessary.
4 > A FailedSync was found, and it was solved restarting docker on the node.
In general you can modify the yaml, for example to avoid using an outdated URL, try to recreate the affected resource, or just remove the init container that causes the issue from your deployment. However the specific solution will depend on the specific root cause.
