Howard_Roark
Howard_Roark

Reputation: 4326

Kubernetes: How To Ensure That Pod A Only Ends Up On Nodes Where Pod B Is Running

I have two use cases where teams only want Pod A to end up on a Node where Pod B is running. They often have many Copies of Pod B running on a Node, but they only want one copy of Pod A running on that same Node.

Currently they are using daemonsets to manage Pod A, which is not effective because then Pod A ends up on a lot of nodes where Pod B is not running. I would prefer not to restrict the nodes they can end up on with labels because that would limit the Node capacity for Pod B (ie- if we have 100 nodes and 20 are labeled, then Pod B's possible capacity is only 20).

In short, how can I ensure that one copy of Pod A runs on any Node with at least one copy of Pod B running?

Upvotes: 0

Views: 1552

Answers (3)

VAS
VAS

Reputation: 9031

As per my understanding, you have Kubernetes cluster with N nodes and some pods of type B is scheduled there. Now you want to have only one instance of pod of type A to be present on the the node where more than zero pods of type B is scheduled. I assume that A<=N and A<=B and ( B>N or B<=N ) (Read <= as greater or equal).

You are using a Daemonset controller to schedule podsA at this moment, and it doesn't work as you want. But you can fix it by forcing Deaemonset to be scheduled by default scheduler instead of DaemonSet controller which schedules its pods without considering pod priority and preemption.

ScheduleDaemonSetPods allows you to schedule DaemonSets using the default scheduler instead of the DaemonSet controller, by adding the NodeAffinity term to the DaemonSet pods, instead of the .spec.nodeName term. The default scheduler is then used to bind the pod to the target host. If node affinity of the DaemonSet pod already exists, it is replaced. The DaemonSet controller only performs these operations when creating or modifying DaemonSet pods, and no changes are made to the spec.template of the DaemonSet.
In addition, node.kubernetes.io/unschedulable:NoSchedule toleration is added automatically to DaemonSet Pods. The default scheduler ignores unschedulable Nodes when scheduling DaemonSet Pods.

So if we add podAaffinity/podAntiAffinity to a Daemonset, the N=number of nodes replicas will be created, but only for nodes that match the condition of (anti)affinity the pods will be scheduled, rest of pods will remain in the Pending state.

Here is an example of such Daemonset:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ds-splunk-sidecar
  namespace: default
  labels:
    k8s-app: ds-splunk-sidecar
spec:
  selector:
    matchLabels:
      name: ds-splunk-sidecar
  template:
    metadata:
      labels:
        name: ds-splunk-sidecar
    spec:
      affinity:
#        podAntiAffinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - splunk-app
            topologyKey: "kubernetes.io/hostname"
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: ds-splunk-sidecar                                                                                                                          
        image: nginx
      terminationGracePeriodSeconds: 30

The output of kubectl get pods -o wide | grep splunk:

ds-splunk-sidecar-26cpt          0/1     Pending     0          4s     <none>         <none>         <none>           <none>
ds-splunk-sidecar-8qvpx          1/1     Running     0          4s     10.244.2.87    kube-node2-2   <none>           <none>
ds-splunk-sidecar-gzn7l          0/1     Pending     0          4s     <none>         <none>         <none>           <none>
ds-splunk-sidecar-ls56g          0/1     Pending     0          4s     <none>         <none>         <none>           <none>
splunk-7d65dfdc99-nz6nz          1/2     Running     0          2d     10.244.2.16    kube-node2-2   <none>           <none>

The output of the kubectl get pod ds-splunk-sidecar-26cpt -o yaml (which is in Pending state). The nodeAffinity section is automatically added to pod.spec without affecting the parent Daemonset configuration:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2020-04-02T13:10:23Z"
  generateName: ds-splunk-sidecar-
  labels:
    controller-revision-hash: 77bfdfc748
    name: ds-splunk-sidecar
    pod-template-generation: "1"
  name: ds-splunk-sidecar-26cpt
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: ds-splunk-sidecar
    uid: 4fda6743-74e3-11ea-8141-42010a9c0004
  resourceVersion: "60026611"
  selfLink: /api/v1/namespaces/default/pods/ds-splunk-sidecar-26cpt
  uid: 4fdf96d5-74e3-11ea-8141-42010a9c0004
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - kube-node2-1
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - splunk-app
        topologyKey: kubernetes.io/hostname
  containers:
  - image: nginx
    imagePullPolicy: Always
    name: ds-splunk-sidecar
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-mxvh9
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30  
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  volumes:
  - name: default-token-mxvh9
    secret:
      defaultMode: 420
      secretName: default-token-mxvh9
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-04-02T13:10:23Z"
    message: '0/4 nodes are available: 1 node(s) didn''t match pod affinity rules,
      1 node(s) didn''t match pod affinity/anti-affinity, 3 node(s) didn''t match
      node selector.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: BestEffort

Alternatively you can achieve the similar results using a Deployment controller:

As soon as we only can automatically scale deployments based on Pod metrics (unless you write your own HPA) , we have to set the number of the A replicas equals to N manually. In the case that there is one node without pod B, one pod of A will stay in the pending state.

There is an almost precise example of the setup described in the question using directive requiredDuringSchedulingIgnoredDuringExecution . Please see the section "More Practical Use-cases: Always co-located in the same nodelink" of the "Assigning Pods to Nodes" documentation page:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deplA
spec:
  selector:
    matchLabels:
      app: deplA
  replicas: N          #<---- Number of nodes in the cluster <= replicas of deplB
  template:
    metadata:
      labels:
        app: deplA
    spec:
      affinity:
        podAntiAffinity: # Prevent scheduling more tnan one PodA on the same node
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - deplA
            topologyKey: "kubernetes.io/hostname"
        podAffinity:   # ensures that PodA is schedules only if PodB is present on the same node.
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - deplB
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: web-app
        image: nginx:1.16-alpine

There is only one problem, the same for both cases, If PodB is rescheduled on the different node by any reason and no more PodB is present on the node, PodA will not be evicted automatically from that node.

That problem could be solved by scheduling a CronJob with kubectl image and proper service-account specified, that every ~5 mins kills all PodsA where no corresponding PodB is present on the same node. (Please search for the existing solution on Stack or ask another question about the script content)

Upvotes: 0

kool
kool

Reputation: 3613

As already explained by coderanger- currect scheduler doesn't support this fuction. Ideal solution would be to create your own scheduler to support such functionality.

However you can use to podAffinity to partially schedule pods on same node.

spec:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - <your_value>
          topologyKey: "kubernetes.io/hostname"

It will try to schedule pods as tightly as possible.

Upvotes: 0

coderanger
coderanger

Reputation: 54191

The current scheduler doesn’t really have anything like this. You would need to write something yourself.

Upvotes: 1

Related Questions