Felipe Colussi-oliva
Felipe Colussi-oliva

Reputation: 188

K8s StatfullSets "pending" after node scale

First of all: I readed other posts like this.

My staging cluster is allocated on AWS using spot instances.

I have arround 50+ pods (runing diferent services / products) and 6 StatefulSets.

I created the StatefulSets this way:

OBS: I do not have PVs and PVCs created manualy, they are being created from the StatfulSet

apiVersion: apps/v1
kind: StatefulSet
  name: redis
    app: redis
      app: redis
  serviceName: "redis"
  replicas: 1
        app: redis
      - name: redis
        image: redis:alpine
        imagePullPolicy: Always
        - containerPort: 6379
          name: client
          - name: data
            mountPath: /data
            readOnly: false
    - metadata:
        name: data
          name: redis-gp2
        accessModes: [ "ReadWriteOnce" ]
            storage: 1Gi
apiVersion: v1
kind: Service
  name: redis
    app: redis
  - port: 6379
    name: redis
    targetPort: 6379
    app: redis
  type: NodePort    

I do have node and pod autoscalers configured.

In the past week after deploying some extra micro-services during the "usage peak" the node autoscaler trigged.

During the scale down some pods(StatefulSets) crashed with the error node(s) had volume node affinity conflict.

My first reaction wast to delete and "recreate" the PVs/PVCs with high priority. That "fixed" the pending pods on that time.

Today I forced another scale-up, so I was able to check what was happening.

The problem occurs during the scalle up and take a long time to go back to normal (+/- 30 min) even after the scalling down.

Describe Pod:

Name:                 redis-0
Namespace:            ***-staging
Priority:             1000
Priority Class Name:  prioridade-muito-alta
Node:                 ip-***-***-***-***.sa-east-1.compute.internal/***.***.*.***
Start Time:           Mon, 03 Jan 2022 09:24:13 -0300
Labels:               app=redis
Annotations:          kubernetes.io/psp: eks.privileged
Status:               Running
IP:                   ***.***.***.***
  IP:           ***.***.***.***
Controlled By:  StatefulSet/redis
    Container ID:   docker://4928f38ed12c206dc5915c863415d3eba98b9592f2ab5c332a900aa2fa2cef64
    Image:          redis:alpine
    Image ID:       docker-pullable://redis@sha256:4bed291aa5efb9f0d77b76ff7d4ab71eee410962965d052552db1fb80576431d
    Port:           6379/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Mon, 03 Jan 2022 09:24:36 -0300
    Ready:          True
    Restart Count:  0
    Environment:    <none>
      /data from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-ngc7q (ro)
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-redis-0
    ReadOnly:   false
    Type:        Secret (a volume populated by a Secret)
    SecretName:  *****
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
  Type     Reason                  Age                  From                                                  Message
  ----     ------                  ----                 ----                                                  -------
  Warning  FailedScheduling        59m (x4 over 61m)    default-scheduler                                     0/7 nodes are available: 1 Too many pods, 1 node(s) were unschedulable, 5 node(s) had volume node affinity conflict.
  Warning  FailedScheduling        58m                  default-scheduler                                     0/7 nodes are available: 1 Too many pods, 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1641210902}, that the pod didn't tolerate, 1 node(s) were unschedulable, 4 node(s) had volume node affinity conflict.
  Warning  FailedScheduling        58m                  default-scheduler                                     0/7 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1641210902}, that the pod didn't tolerate, 1 node(s) were unschedulable, 2 Too many pods, 3 node(s) had volume node affinity conflict.
  Warning  FailedScheduling        57m (x2 over 58m)    default-scheduler                                     0/7 nodes are available: 2 Too many pods, 2 node(s) were unschedulable, 3 node(s) had volume node affinity conflict.
  Warning  FailedScheduling        50m (x9 over 57m)    default-scheduler                                     0/6 nodes are available: 1 node(s) were unschedulable, 2 Too many pods, 3 node(s) had volume node affinity conflict.
  Warning  FailedScheduling        48m (x2 over 49m)    default-scheduler                                     0/5 nodes are available: 2 Too many pods, 3 node(s) had volume node affinity conflict.
  Warning  FailedScheduling        35m (x10 over 48m)   default-scheduler                                     0/5 nodes are available: 1 Too many pods, 4 node(s) had volume node affinity conflict.
  Normal   NotTriggerScaleUp       30m (x163 over 58m)  cluster-autoscaler                                    pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) had volume node affinity conflict
  Warning  FailedScheduling        30m (x3 over 33m)    default-scheduler                                     0/5 nodes are available: 5 node(s) had volume node affinity conflict.
  Normal   SuccessfulAttachVolume  29m                  attachdetach-controller                               AttachVolume.Attach succeeded for volume "pvc-23168a78-2286-40b7-aa71-194ca58e0005"
  Normal   Pulling                 28m                  kubelet, ip-***-***-***-***.sa-east-1.compute.internal  Pulling image "redis:alpine"
  Normal   Pulled                  28m                  kubelet, ip-***-***-***-***.sa-east-1.compute.internal  Successfully pulled image "redis:alpine" in 3.843908086s
  Normal   Created                 28m                  kubelet, ip-***-***-***-***.sa-east-1.compute.internal  Created container redis
  Normal   Started                 28m                  kubelet, ip-***-***-***-***.sa-east-1.compute.internal  Started container redis


Name:          data-redis-0
Namespace:     ***-staging
StorageClass:  gp2
Status:        Bound
Volume:        pvc-23168a78-2286-40b7-aa71-194ca58e0005
Labels:        app=redis
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
               volume.kubernetes.io/selected-node: ip-***-***-***-***.sa-east-1.compute.internal
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      1Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    redis-0
Events:        <none>


Name:              pvc-23168a78-2286-40b7-aa71-194ca58e0005
Labels:            failure-domain.beta.kubernetes.io/region=sa-east-1
Annotations:       kubernetes.io/createdby: aws-ebs-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      gp2
Status:            Bound
Claim:             ***-staging/data-redis-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          1Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [sa-east-1b]
                   failure-domain.beta.kubernetes.io/region in [sa-east-1]
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://sa-east-1b/vol-061fd23a65185d42c
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

This happend in 4 of my 6 StatefulSets.


If I create PVs and PVCs manually setting:

volumeBindingMode: WaitForFirstConsumer
- matchLabelExpressions:
  - key: failure-domain.beta.kubernetes.io/zone
    - sa-east-1

will the scale up/down not mess up with StatefulSets?

If not what can I do to avoid this problem ?

Upvotes: 2

Views: 753

Answers (2)

Piotr Malec
Piotr Malec

Reputation: 3657

You can also avoid this problem by separating your Kubernetes workload with nodepool segregation and affinity options as mentioned in this external article.

In a case where only a portion of your workload requires PVs/PVCs I would suggest using a dedicated nodepool for your statefulsets.

Upvotes: 1

Vasilii Angapov
Vasilii Angapov

Reputation: 9032

First of all, it's better to move allowedTopologies stanza to StorageClass. It's more flexible because you can create multiple zone-specific storage classes.

And yes, this should obviously solve your one problem and create another. You basically want to sacrifice high availability to costs/convenience. It's totally up to you, there is no one-size-fits-all recommendation here but I just want to make sure you know the options.

You may still have volumes not tied to specific zones if you always have enough node capacity in every AZ. This can be achieved using cluster-autoscaler. Generally, you create separate node groups per each AZ and autoscaler will do the rest.

Another option is to build distributed storage like Ceph or Portworx that allows to mount volumes from another AZ. That will greatly increase your cross-AZ traffic costs and needs to be maintained properly but I know companies that do that.

Upvotes: 2

Related Questions