Linux Devops
Linux Devops

Reputation: 11

Pods stuck on "Terminating" after compute-node shutdown

I am running an OCP4.6 with RHEL7.8 BareMetal compute nodes. We are running functionality and HA testing on the cluster. Our main application on this cluster is a StatefulSet with around 250 pods.

After shutting down a node, the pods running on the node entered a Terminating state, and are stuck there. Since this is a StatefulSet, pods cannot restart on another node until the original pod finishes terminating.

I can delete the pods with --force --grace-period=0 but this does not solve my issue.

These pods only terminate after the server that was shut down returns to Ready status.

Any ideas??

UPDATE:

Looking at k8s' docs - I found that the fact a StatefulSet pod doesn't terminate after a node shuts down is actually a saftey mechanism, and is in fact a feature: https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/

Upvotes: 0

Views: 10517

Answers (2)

titou10
titou10

Reputation: 2977

Maybe you can check if your pod defines a "finalizer". Sometime a pod will not "terminate" because it is waiting for the "finalizer" action to finish but the situation is so that the finalizer can not run for whatever reason

If so, you can try to edit the pod and remove the "finalizer" section to see if your pod really goes away

Of course doing so may leave your apps in a bad state depending on what the finalizer was supposed to do

Some links:

Upvotes: 0

Wytrzymały Wiktor
Wytrzymały Wiktor

Reputation: 13878

If you want to avoid Pods being stuck when you shot down your Node you should try to Safely Drain a Node:

You can use kubectl drain to safely evict all of your pods from a node before you perform maintenance on the node (e.g. kernel upgrade, hardware maintenance, etc.). Safe evictions allow the pod's containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified.

When kubectl drain returns successfully, that indicates that all of the pods have been safely evicted (respecting the desired graceful termination period, and respecting the PodDisruptionBudget you have defined). It is then safe to bring down the node by powering down its physical machine or, if running on a cloud platform, deleting its virtual machine.

Also note that in case of Stuck evictions:

  • Abort or pause the automated operation. Investigate the reason for the stuck application, and restart the automation.

  • After a suitably long wait, DELETE the Pod from your cluster's control plane, instead of using the eviction API.

Kubernetes does not specify what the behavior should be in this case; it is up to the application owners and cluster owners to establish an agreement on behavior in these cases.

In order to investigate the stuck Pods you can:

  • Check the Pods' logs with kubectl logs ${POD_NAME}

  • kubectl describe pod and check its Events section

  • Debug with container exec

More details can be found in the linked docs.

Upvotes: 0

Related Questions