How do I specify a minimum number of schedulable pods in kubernetes?

Question

I want to run multiple jobs in a kubernetes cluster, but the total resource requirements exceed the size of the cluster, and the requirements of one job span multiple nodes. How do I avoid a livelock where all jobs have some resources, but none have enough to complete?

For example, suppose I have 4 nodes, each with 1 GB of memory available. I want to submit 2 jobs, each of which requires 3 GB of memory to complete, split across 3 pods that require 1 GB each. The correct solution here would be to run the jobs sequentially, how do I ensure this happens?

I want to avoid the situation where both jobs schedule two pods each, using up the entire cluster, while the remaining pod of each job is stuck in the Pending state, as no more resources are available. Because the jobs cannot complete using only 2 GB of memory, the system is now incapable of making progress.

Related Features

Some features I've looked at that don't seem to be suitable:

Pod Disruption Budget - this is for ensuring that the number of pods never goes below X, but doesn't have any effect when scheduling the pods initially
Pod Affinity - this can ensure I schedule pods in a region where a matching pod is running, but I can't require two or more pods. I'm also not sure if affinity would be satisfied if no pods are running but they are scheduled.
Pod Topology Spread Constraints - This is to ensure that the numbers of pods scheduled in multiple regions is always within N of each other, but again I can't specify a required minimum.

Possible Solution

It looks like a custom scheduler is needed. Kube Batch looks like a possible solution for this, supporting a minMember attribute. I will test this and submit it as a self-answer, unless anyone can chime in with more detail.

ericfossas · Accepted Answer

The easy solution is to assign each a job a PriorityClass so that one job can preempt the other if needed:

https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/

However, this means one job will always have priority over the other. If you need them to run in the order they were received, you need a queue job system. Here is one you can try:

https://github.com/kubernetes-sigs/kueue

Using kueue, you would create a Workload for each job as they come in and add it to the same LocalQueue.

How do I specify a minimum number of schedulable pods in kubernetes?

Related Features

Possible Solution

Answers (1)

Related Questions