How can I prevent horizontal pod autoscaler from taking out pods which are actively doing work?

Question

I want to run a set of long-running tasks in batch using Horizontal Pod Autoscaler. These tasks can take few minutes or few hours to run in some cases, and always use 80~100% available CPU resources.

I want to understand Autoscaler's behavior when it decides it is time to scale down the fleet.

Let's say there are 4 instances that are all doing work and they are all at 95% CPU utilization. It can no longer scale up because max instance # is set to 4. And the scale-up threshold is set at 75% avg CPU utilization.
If 2 instances complete the work early, but the other 2 still have hours of work left, then the average CPU utilization of the fleet can fall down to 50%.
Then Autoscaler decides that it is time to scale down. However, 2 out of 4 instances are still doing work, so there is a 50% chance that Autoscaler might select the pod which is actively doing work and terminates it.
If that happens, that work progress will be lost & marked as incomplete, and one of the available pods will fetch the work and start the work from the beginning.

Is there a way to prevent this from happening by prioritizing pods with the lowest CPU utilization to be selected for scale down first? That way, those pods which are processing works will be left untouched.

whites11 · Accepted Answer

I am not aware of a way to customize which replicas in a deployment should be deleted when scaling down the number of replicas.

Maybe you can solve your problem by setting terminationGracePeriodSeconds and using the preStop hook.

With terminationGracePeriodSeconds you can specify how long the containers in a pod will wait between when the first SIGTERM signal is sent and the SIGKILL signal is sent. This is suboptimal for you because AFAIU you don't know how long it will take to the pod to complete the assigned tasks. But if you set this value high enough, you can leverage the preStop hook as well. From the documentation:

PreStop is called immediately before a container is terminated due to an API request or management event such as liveness/startup probe failure, preemption, resource contention, etc. The handler is not called if the container crashes or exits. The reason for termination is passed to the handler. The Pod's termination grace period countdown begins before the PreStop hooked is executed. Regardless of the outcome of the handler, the container will eventually terminate within the Pod's termination grace period. Other management of the container blocks until the hook completes or until the termination grace period is reached.

If you are able from within the container to run a command that "blocks" until the container is finished working then you should be able to make it terminate only when it's idle.

Let me also link a nice blog post explaining how the whole thing works: https://pracucci.com/graceful-shutdown-of-kubernetes-pods.html

How can I prevent horizontal pod autoscaler from taking out pods which are actively doing work?

Answers (1)

Related Questions