Reputation: 11
I am running an OCP4.6 with RHEL7.8 BareMetal compute nodes. We are running functionality and HA testing on the cluster. Our main application on this cluster is a StatefulSet with around 250 pods.
After shutting down a node, the pods running on the node entered a Terminating
state, and are stuck there.
Since this is a StatefulSet, pods cannot restart on another node until the original pod finishes terminating.
I can delete the pods with --force --grace-period=0
but this does not solve my issue.
These pods only terminate after the server that was shut down returns to Ready
status.
Any ideas??
UPDATE:
Looking at k8s' docs - I found that the fact a StatefulSet pod doesn't terminate after a node shuts down is actually a saftey mechanism, and is in fact a feature: https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/
Upvotes: 0
Views: 10517
Reputation: 2977
Maybe you can check if your pod defines a "finalizer". Sometime a pod will not "terminate" because it is waiting for the "finalizer" action to finish but the situation is so that the finalizer can not run for whatever reason
If so, you can try to edit the pod and remove the "finalizer" section to see if your pod really goes away
Of course doing so may leave your apps in a bad state depending on what the finalizer was supposed to do
Some links:
Upvotes: 0
Reputation: 13878
If you want to avoid Pods being stuck when you shot down your Node you should try to Safely Drain a Node:
You can use
kubectl drain
to safely evict all of your pods from a node before you perform maintenance on the node (e.g. kernel upgrade, hardware maintenance, etc.). Safe evictions allow the pod's containers to gracefully terminate and will respect thePodDisruptionBudgets
you have specified.
When
kubectl drain
returns successfully, that indicates that all of the pods have been safely evicted (respecting the desired graceful termination period, and respecting thePodDisruptionBudget
you have defined). It is then safe to bring down the node by powering down its physical machine or, if running on a cloud platform, deleting its virtual machine.
Also note that in case of Stuck evictions:
Abort or pause the automated operation. Investigate the reason for the stuck application, and restart the automation.
After a suitably long wait,
DELETE
the Pod from your cluster's control plane, instead of using the eviction API.Kubernetes does not specify what the behavior should be in this case; it is up to the application owners and cluster owners to establish an agreement on behavior in these cases.
In order to investigate the stuck Pods you can:
Check the Pods' logs with kubectl logs ${POD_NAME}
kubectl describe pod
and check its Events section
More details can be found in the linked docs.
Upvotes: 0