Gleeb
Gleeb

Reputation: 11289

Kubernetes pod distribution amongst nodes

Is there any way to make kubernetes distribute pods as much as possible? I have "Requests" on all deployments and global Requests as well as HPA. all nodes are the same.

Just had a situation where my ASG scaled down a node and one service became completely unavailable as all 4 pods were on the same node that was scaled down.

I would like to maintain a situation where each deployment must spread its containers on at least 2 nodes.

Upvotes: 33

Views: 21621

Answers (3)

Rotem jackoby
Rotem jackoby

Reputation: 22068

Instead of podAntiAffinity please review Pod Topology spread constraints:

You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.

Example:

kind: Pod
apiVersion: v1
metadata:
  name: mypod
  labels:
    foo: bar
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar
  containers:
  - name: pause
    image: registry.k8s.io/pause:3.1

New features in K8s 1.27: More fine-grained pod topology spread policies reached beta.

Upvotes: 0

Maxim Yefremov
Maxim Yefremov

Reputation: 14165

Here I leverage Anirudh's answer adding example code.

My initial kubernetes yaml looked like this:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: say-deployment
spec:
  replicas: 6
  template:
    metadata:
      labels:
        app: say
    spec:
      containers:
      - name: say
        image: gcr.io/hazel-champion-200108/say
        ports:
        - containerPort: 8080
---
kind: Service
apiVersion: v1
metadata:
  name: say-service
spec:
  selector:
    app: say
  ports:
    - protocol: TCP
      port: 8080
  type: LoadBalancer
  externalIPs:
    - 192.168.0.112

At this point, kubernetes scheduler somehow decides that all the 6 replicas should be deployed on the same node.

Then I added requiredDuringSchedulingIgnoredDuringExecution to force the pods beeing deployed on different nodes:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: say-deployment
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: say
    spec:
      containers:
      - name: say
        image: gcr.io/hazel-champion-200108/say
        ports:
        - containerPort: 8080
      affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                        - key: "app"
                          operator: In
                          values:
                          - say
                    topologyKey: "kubernetes.io/hostname"
---
kind: Service
apiVersion: v1
metadata:
  name: say-service
spec:
  selector:
    app: say
  ports:
    - protocol: TCP
      port: 8080
  type: LoadBalancer
  externalIPs:
    - 192.168.0.112

Now all the pods are run on different nodes. And since I have 3 nodes and 6 pods, other 3 pods (6 minus 3) can't be running (pending). This is because I required it: requiredDuringSchedulingIgnoredDuringExecution.

kubectl get pods -o wide 

NAME                              READY     STATUS    RESTARTS   AGE       IP            NODE
say-deployment-8b46845d8-4zdw2   1/1       Running            0          24s       10.244.2.80   night
say-deployment-8b46845d8-699wg   0/1       Pending            0          24s       <none>        <none>
say-deployment-8b46845d8-7nvqp   1/1       Running            0          24s       10.244.1.72   gray
say-deployment-8b46845d8-bzw48   1/1       Running            0          24s       10.244.0.25   np3
say-deployment-8b46845d8-vwn8g   0/1       Pending            0          24s       <none>        <none>
say-deployment-8b46845d8-ws8lr   0/1       Pending            0          24s       <none>        <none>

Now if I loosen this requirement with preferredDuringSchedulingIgnoredDuringExecution:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: say-deployment
spec:
  replicas: 6
  template:
    metadata:
      labels:
        app: say
    spec:
      containers:
      - name: say
        image: gcr.io/hazel-champion-200108/say
        ports:
        - containerPort: 8080
      affinity:
              podAntiAffinity:
                preferredDuringSchedulingIgnoredDuringExecution:
                  - weight: 100
                    podAffinityTerm:
                      labelSelector:
                        matchExpressions:
                          - key: "app"
                            operator: In
                            values:
                            - say
                      topologyKey: "kubernetes.io/hostname"
---
kind: Service
apiVersion: v1
metadata:
  name: say-service
spec:
  selector:
    app: say
  ports:
    - protocol: TCP
      port: 8080
  type: LoadBalancer
  externalIPs:
    - 192.168.0.112

First 3 pods are deployed on 3 different nodes just like in the previous case. And the rest 3 (6 pods minus 3 nodes) are deployed on various nodes according to kubernetes internal considerations.

NAME                              READY     STATUS    RESTARTS   AGE       IP            NODE
say-deployment-57cf5fb49b-26nvl   1/1       Running   0          59s       10.244.2.81   night
say-deployment-57cf5fb49b-2wnsc   1/1       Running   0          59s       10.244.0.27   np3
say-deployment-57cf5fb49b-6v24l   1/1       Running   0          59s       10.244.1.73   gray
say-deployment-57cf5fb49b-cxkbz   1/1       Running   0          59s       10.244.0.26   np3
say-deployment-57cf5fb49b-dxpcf   1/1       Running   0          59s       10.244.1.75   gray
say-deployment-57cf5fb49b-vv98p   1/1       Running   0          59s       10.244.1.74   gray

Upvotes: 39

Anirudh Ramanathan
Anirudh Ramanathan

Reputation: 46728

Sounds like what you want is Inter-Pod Affinity and Pod Anti-affinity.

Inter-pod affinity and anti-affinity were introduced in Kubernetes 1.4. Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to schedule on based on labels on pods that are already running on the node rather than based on labels on nodes. The rules are of the form “this pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y.” Y is expressed as a LabelSelector with an associated list of namespaces (or “all” namespaces); unlike nodes, because pods are namespaced (and therefore the labels on pods are implicitly namespaced), a label selector over pod labels must specify which namespaces the selector should apply to. Conceptually X is a topology domain like node, rack, cloud provider zone, cloud provider region, etc. You express it using a topologyKey which is the key for the node label that the system uses to denote such a topology domain, e.g. see the label keys listed above in the section “Interlude: built-in node labels.”

Anti-affinity can be used to ensure that you are spreading your pods across failure domains. You can state these rules as preferences, or as hard rules. In the latter case, if it is unable to satisfy your constraint, the pod would fail to get scheduled.

Upvotes: 13

Related Questions