Bhaumik Patel
Bhaumik Patel

Reputation: 15686

K8s Daemonset Pod Placement

We have 4 nodes EKS cluster. We have some pods (part of Daemonset) in pending status as the Node is full and there is no capacity in the node to run the pod. The question is do we need to manually reshuffle the workloads to make Daemonset pods running in this situation or is there any configuration to overcome this issue in an automated fashion?

Note: we have also installed Cluster Autoscaler which works perfectly for deployments.

Thank you in advance.

Upvotes: 1

Views: 905

Answers (3)

Rahul Kumar
Rahul Kumar

Reputation: 1

After spending quite some amount of time and observing all cases in detail, the following can help in understanding the issue.

By design Daemonset pods are specified to run on all nodes(or group of nodes depending on tolerations specified in daemonset manifest).

Since you have cluster autoscaler enabled, when a new node is spinned when either scaleup happens or a new deployment is introduced, the daemonsets are the first ones to get placed. (The pods are created by daemonset controller and scheduled by the scheduler). However in your case its in pending state. The reason could be guessed as priority of the daemonset pods. If the priority of daemonset pod is less than the priority of your workload pods and workload pod has property preemptionPolicy as PreemptLowerPriority, the workload pod will evict the daemonset pod and it will go into pending state as there was not enough resources for workload pod to get placed. Since for daemonsets pending pods, new nodes aren't spawned, it will remain in pending state.

your scenario example image of pod and daemonset

If the daemonset pods has to be running on every node(which it should because thats the purpose we defined it as daemonset and not deployment), set the priority of the pods either greater than the workload pods or as system-node-critical(This will ensure every node(or group of nodes) are running daemonset pods).

Along with this, let me tell you few other scenarios:

  • Scenario 1: Daemonset pods priority less than workload pods. The workload pod will evict the daemonset pod and daemonset pod will go into pending state. (It is your case viz explained above)

  • Scenario 2: Daemonset pods priority greater than workload pods. The daemonset pods will be running on nodes and will not get evicted by workload pods. This is the ideal scenario which should be followed.

But wait, what if the priority of daemonset pod is equal to that of workload pod? Lets discuss this in scenario 3.

  • Scenario 3: Let me explain it in very detail practically. A scaleup is triggered for new node as workload pod needs a node to run.

case a) If after placing the daemonset pod, enough resources are there for workload pods, it will get placed. Both will continue to run.

case b) There is not enough resources for workload pod to get placed. In this case the workloads will be in pending state forever. There is no such node which can accommodate this workload pod. Infact no new node will be spawned.

Scenario2: case b example image

case c) Now coming to fun part. There is not enough resources for workload pod to get placed after daemonset pods are placed and we manually scaleup the cluster i.e., increase the desired count by ourselves. The workload pods gets eventually running and daemonsets pods gets into pending state. It is noted that this happens only when there is manual desired node count changed in cluster autoscaler directly.

Cluster autoscaler modification- changing desired node count values

manually scaling up cluster, increasing the node count by self scenario daemonset and workload pod example

I hope this detailed practical answer helps you in understanding the behaviour across multiple scenarios. Feel free to comment any other cases you'd encounter, will be happy to test as well in running environment.

Upvotes: 0

ishuar
ishuar

Reputation: 1328

As those pods are part of a Daemonset they are expected to be scheduled on every node attached to the cluster, which means that you have to make space for the pods on the node they are failing.

If you have written that daemonset on your own you can specify .spec.template.spec.nodeSelector, then the DaemonSet controller will create Pods on nodes that match that node selector. Likewise, if you specify a .spec.template.spec.affinity, then DaemonSet controller will create Pods on nodes that match that node affinity. If you do not specify either, then the DaemonSet controller will create Pods on all nodes as per official documentation. Or you can leverage if the daemonset (third party written) already support any of the scheduling.

You can also think about increasing the Node size aka instance type for the Node group but has to be careful with that as nodes are immutable and have to be replaced with a new instance type or with a new Node group. For a complete answer on updating Node instance type refer here

Upvotes: 0

The Fool
The Fool

Reputation: 20420

Kubernetes has pod priorities and preemption for this specific purpose.

Pods can have priority. Priority indicates the importance of a Pod relative to other Pods. If a Pod cannot be scheduled, the scheduler tries to preempt (evict) lower priority Pods to make scheduling of the pending Pod possible.
ref: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/

If EKS does not have priority classes pre-configured, you can create one yourself. For example, the one from the docs which is a preempting one:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."

Then you use that class on your daemon set

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  priorityClassName: high-priority # this is important

Note this is just a small example copied from linked docs, you should read the docs carefully and perhaps also review how this would interact with pod disruption budgets.

Also note, that this may cause disruption to other deployments, depending on various factors such as the Update Strategy. So, be careful.

Upvotes: 2

Related Questions