Reputation: 4326
I have two use cases where teams only want Pod A to end up on a Node where Pod B is running. They often have many Copies of Pod B running on a Node, but they only want one copy of Pod A running on that same Node.
Currently they are using daemonsets to manage Pod A, which is not effective because then Pod A ends up on a lot of nodes where Pod B is not running. I would prefer not to restrict the nodes they can end up on with labels because that would limit the Node capacity for Pod B (ie- if we have 100 nodes and 20 are labeled, then Pod B's possible capacity is only 20).
In short, how can I ensure that one copy of Pod A runs on any Node with at least one copy of Pod B running?
Upvotes: 0
Views: 1552
Reputation: 9031
As per my understanding, you have Kubernetes cluster with N nodes and some pods of type B is scheduled there. Now you want to have only one instance of pod of type A to be present on the the node where more than zero pods of type B is scheduled. I assume that A<=N and A<=B and ( B>N or B<=N ) (Read <=
as greater or equal).
You are using a Daemonset controller
to schedule podsA at this moment, and it doesn't work as you want. But you can fix it by forcing Deaemonset
to be scheduled by default scheduler instead of DaemonSet controller
which schedules its pods without considering pod priority and preemption.
ScheduleDaemonSetPods
allows you to schedule DaemonSets using the default scheduler instead of the DaemonSet controller, by adding theNodeAffinity
term to the DaemonSet pods, instead of the.spec.nodeName
term. The default scheduler is then used to bind the pod to the target host. If node affinity of the DaemonSet pod already exists, it is replaced. The DaemonSet controller only performs these operations when creating or modifying DaemonSet pods, and no changes are made to thespec.template
of the DaemonSet.
In addition,node.kubernetes.io/unschedulable:NoSchedule
toleration is added automatically to DaemonSet Pods. The default scheduler ignoresunschedulable
Nodes when scheduling DaemonSet Pods.
So if we add podAaffinity
/podAntiAffinity
to a Daemonset, the N=number of nodes
replicas will be created, but only for nodes that match the condition of (anti)affinity the pods will be scheduled, rest of pods will remain in the Pending
state.
Here is an example of such Daemonset:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ds-splunk-sidecar
namespace: default
labels:
k8s-app: ds-splunk-sidecar
spec:
selector:
matchLabels:
name: ds-splunk-sidecar
template:
metadata:
labels:
name: ds-splunk-sidecar
spec:
affinity:
# podAntiAffinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- splunk-app
topologyKey: "kubernetes.io/hostname"
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: ds-splunk-sidecar
image: nginx
terminationGracePeriodSeconds: 30
The output of kubectl get pods -o wide | grep splunk
:
ds-splunk-sidecar-26cpt 0/1 Pending 0 4s <none> <none> <none> <none>
ds-splunk-sidecar-8qvpx 1/1 Running 0 4s 10.244.2.87 kube-node2-2 <none> <none>
ds-splunk-sidecar-gzn7l 0/1 Pending 0 4s <none> <none> <none> <none>
ds-splunk-sidecar-ls56g 0/1 Pending 0 4s <none> <none> <none> <none>
splunk-7d65dfdc99-nz6nz 1/2 Running 0 2d 10.244.2.16 kube-node2-2 <none> <none>
The output of the kubectl get pod ds-splunk-sidecar-26cpt -o yaml
(which is in Pending state). The nodeAffinity section is automatically added to pod.spec without affecting the parent Daemonset configuration:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2020-04-02T13:10:23Z"
generateName: ds-splunk-sidecar-
labels:
controller-revision-hash: 77bfdfc748
name: ds-splunk-sidecar
pod-template-generation: "1"
name: ds-splunk-sidecar-26cpt
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: ds-splunk-sidecar
uid: 4fda6743-74e3-11ea-8141-42010a9c0004
resourceVersion: "60026611"
selfLink: /api/v1/namespaces/default/pods/ds-splunk-sidecar-26cpt
uid: 4fdf96d5-74e3-11ea-8141-42010a9c0004
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- kube-node2-1
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- splunk-app
topologyKey: kubernetes.io/hostname
containers:
- image: nginx
imagePullPolicy: Always
name: ds-splunk-sidecar
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-mxvh9
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
volumes:
- name: default-token-mxvh9
secret:
defaultMode: 420
secretName: default-token-mxvh9
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-04-02T13:10:23Z"
message: '0/4 nodes are available: 1 node(s) didn''t match pod affinity rules,
1 node(s) didn''t match pod affinity/anti-affinity, 3 node(s) didn''t match
node selector.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: BestEffort
Alternatively you can achieve the similar results using a Deployment controller:
As soon as we only can automatically scale deployments based on Pod metrics (unless you write your own HPA) , we have to set the number of the A replicas equals to N manually. In the case that there is one node without pod B, one pod of A will stay in the pending state.
There is an almost precise example of the setup described in the question using directive requiredDuringSchedulingIgnoredDuringExecution . Please see the section "More Practical Use-cases: Always co-located in the same nodelink" of the "Assigning Pods to Nodes" documentation page:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deplA
spec:
selector:
matchLabels:
app: deplA
replicas: N #<---- Number of nodes in the cluster <= replicas of deplB
template:
metadata:
labels:
app: deplA
spec:
affinity:
podAntiAffinity: # Prevent scheduling more tnan one PodA on the same node
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- deplA
topologyKey: "kubernetes.io/hostname"
podAffinity: # ensures that PodA is schedules only if PodB is present on the same node.
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- deplB
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.16-alpine
There is only one problem, the same for both cases, If PodB is rescheduled on the different node by any reason and no more PodB is present on the node, PodA will not be evicted automatically from that node.
That problem could be solved by scheduling a CronJob with kubectl
image and proper service-account specified, that every ~5 mins kills all PodsA where no corresponding PodB is present on the same node. (Please search for the existing solution on Stack or ask another question about the script content)
Upvotes: 0
Reputation: 3613
As already explained by coderanger- currect scheduler doesn't support this fuction. Ideal solution would be to create your own scheduler to support such functionality.
However you can use to podAffinity to partially schedule pods on same node.
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- <your_value>
topologyKey: "kubernetes.io/hostname"
It will try to schedule pods as tightly as possible.
Upvotes: 0
Reputation: 54191
The current scheduler doesn’t really have anything like this. You would need to write something yourself.
Upvotes: 1