Valentin Ouvrard
Valentin Ouvrard

Reputation: 304

GKE Can’t scale up nodes due of PersistentVolume

I'm getting a strange problem on my Terraformed GKE cluster,

I have a deployment that request a GcePersistentVolume with a PVC, when it got created, I have a Can’t scale up nodes notification on my GCloud console. If I inspect the log, it say that :

reason: {
messageId: "no.scale.up.mig.failing.predicate"
parameters: [
0: ""
1: "pod has unbound immediate PersistentVolumeClaims"

Without creating this deployment, I have no Scale UP error at all.

The PVC in question :

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  finalizers:
  - kubernetes.io/pvc-protection
  name: nfs
  namespace: nfs
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard
  volumeMode: Filesystem
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  phase: Bound

My deployment is running fine and the PV is directly created and bound to my PVC.

So I find this Can't Scale up Nodes really strange ?

(It's a single zone cluster, with a single NodePool).

Any idea for me ?

Thanks a lot

Upvotes: 1

Views: 2568

Answers (1)

CamiloARG
CamiloARG

Reputation: 31

I'm having the same problem. It is weird because if you are creating a PVC in GKE, the PV is created dynamically (and indeed it is), so you go and check with kubectl get pv,pvc --all-namespaces and everything seems normal. But it seems that when a deployment (that uses a PVC) is created and while waiting for the creation of the PVC this error appears and the cluster acknowledges it and displays the alert (creating some false positive alerts). It seems like a timing issue.

One turnaround is to change the value of the storageClassName definition. If instead of standard you use standard-rwo (both appear as default in Storage Classes tab in Storage) the problem seems to disappear. The consequence of this is that the type of the underlying disk changes from Standard persistent disk to Balanced persistent disk. Anyhow, the latter one performs better.

EDIT: It is about Storage Classes. The volumeBindingMode of the default standard class is Immediate. According to the documentation:

The Immediate mode indicates that volume binding and dynamic provisioning occurs once the PersistentVolumeClaim is created. For storage backends that are topology-constrained and not globally accessible from all Nodes in the cluster, PersistentVolumes will be bound or provisioned without knowledge of the Pod's scheduling requirements. This may result in unschedulable Pods.

A cluster administrator can address this issue by specifying the WaitForFirstConsumer mode which will delay the binding and provisioning of a PersistentVolume until a Pod using the PersistentVolumeClaim is created. PersistentVolumes will be selected or provisioned conforming to the topology that is specified by the Pod's scheduling constraints. These include, but are not limited to, resource requirements, node selectors, pod affinity and anti-affinity, and taints and tolerations.

So, if all the properties of the standard Storage class are required to be kept, another solution would be to create another Storage class:

  1. Download the YAML of the standard Storage class
  2. Change the name definition
  3. Change the property from volumeBindingMode: Immediate to volumeBindingMode: WaitForFirstConsumer.
  4. Apply it (kubectl apply -f <file path> )
  5. And in the storageClassName definition of the PVC, change it to the name of the step #2

Upvotes: 3

Related Questions