cbtmasters
cbtmasters

Reputation: 9

Uneven pod scheduling

I am running a Kubernetes Job which would launch a pod based on the parallelism which we define in Job manifest. Currently when i scheduled the job the pods are unevenly scheduled across the node. I have a 3 nodes cluster with no workloads apart from this Job and same compute shape. Why is k8s scheduler unable to equally spread based on the compute resource availability? my understanding that the default scheduler (regardless of how the pod is created) should do round robin when all nodes have the same available resources

Upvotes: 0

Views: 1121

Answers (2)

Crou
Crou

Reputation: 11446

Running jobs in parallel have little to nothing regarding how are they are scheduled.

Running in parallel means there will be "few" jobs running in the same time, this does not specify where they will be running.

You can read in the documentation regarding Kubernetes Scheduler.

kube-scheduler is the default scheduler for Kubernetes and runs as part of the control plane. kube-scheduler is designed so that, if you want and need to, you can write your own scheduling component and use that instead.

For every newly created pod or other unscheduled pods, kube-scheduler selects an optimal node for them to run on. However, every container in pods has different requirements for resources and every pod also has different requirements. Therefore, existing nodes need to be filtered according to the specific scheduling requirements.

In a cluster, Nodes that meet the scheduling requirements for a Pod are called feasible nodes. If none of the nodes are suitable, the pod remains unscheduled until the scheduler is able to place it.

The scheduler finds feasible Nodes for a Pod and then runs a set of functions to score the feasible Nodes and picks a Node with the highest score among the feasible ones to run the Pod. The scheduler then notifies the API server about this decision in a process called binding.

Factors that need taken into account for scheduling decisions include individual and collective resource requirements, hardware / software / policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, and so on.

...

kube-scheduler selects a node for the pod in a 2-step operation:

  1. Filtering
  2. Scoring

The filtering step finds the set of Nodes where it's feasible to schedule the Pod. For example, the PodFitsResources filter checks whether a candidate Node has enough available resource to meet a Pod's specific resource requests. After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn't (yet) schedulable.

In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules.

Finally, kube-scheduler assigns the Pod to the Node with the highest ranking. If there is more than one node with equal scores, kube-scheduler selects one of these at random.

There are two supported ways to configure the filtering and scoring behavior of the scheduler:

  1. Scheduling Policies allow you to configure Predicates for filtering and Priorities for scoring.
  2. Scheduling Profiles allow you to configure Plugins that implement different scheduling stages, including: QueueSort, Filter, Score, Bind, Reserve, Permit, and others. You can also configure the kube-scheduler to run different profiles.

You can also check the documentation for Scheduler Performance Tuning.

Upvotes: 2

cbtmasters
cbtmasters

Reputation: 9

Here is my sample job definition. I have 1 master and 2 node cluster in a my demo env. When i check

apiVersion: batch/v1
kind: Job
metadata:
  name: pi-with-timeout
  labels:
    app: test
spec:
  backoffLimit: 5
  activeDeadlineSeconds: 100
  parallelism: 15
  template:
    spec:
      containers:
      - name: pi
        image: busybox
        args:
          - /bin/sh
          - -c
          - date; echo sleeping....; sleep 90s; echo exiting...;
      nodeSelector:
        app: test
      restartPolicy: Never

pi-with-timeout-6bmnk             1/2     Running           0          11s    10.24.2.47    um701   <none>           <none>
pi-with-timeout-6kxxt             2/2     Running           0          11s    10.244.2.46    um701   <none>           <none>
pi-with-timeout-7nt4l             0/2     PodInitializing   0          11s    10.244.1.151   um758   <none>           <none>
pi-with-timeout-dl5x6             0/2     PodInitializing   0          11s    10.244.1.145   um758   <none>           <none>
pi-with-timeout-gkldn             0/2     PodInitializing   0          11s    10.244.1.149   um758   <none>           <none>
pi-with-timeout-j7vp7             2/2     Running           0          11s    10.244.2.48    um701   <none>           <none>
pi-with-timeout-jzl72             1/2     Running           0          11s    10.244.1.144   um758   <none>           <none>
pi-with-timeout-jzzsz             2/2     Running           0          11s    10.244.1.146   um758   <none>           <none>
pi-with-timeout-kndrj             2/2     Running           0          11s    10.244.1.140   um758   <none>           <none>
pi-with-timeout-pl7mr             0/2     PodInitializing   0          11s    10.244.1.147   um758   <none>           <none>
pi-with-timeout-rjzxj             0/2     PodInitializing   0          11s    10.244.1.143   um758   <none>           <none>
pi-with-timeout-scccq             2/2     Running           0          11s    10.244.1.142   um758   <none>           <none>
pi-with-timeout-vj2jm             0/2     PodInitializing   0          11s    10.244.1.141   um758   <none>           <none>
pi-with-timeout-x9kt8             0/2     PodInitializing   0          11s    10.244.1.150   um758   <none>           <none>
pi-with-timeout-xsggq             0/2     PodInitializing   0          11s    10.244.1.148   um758   <none>           <none>

If you see the above output you will notice that um701 worker node has only 3 pods scheduled where as um758 has remaining pods scheduled.

Upvotes: 0

Related Questions