Why isn't GKE scaling down cluster nodes even though I only have one pod?

I know there are some existing questions out there, they usually refer to this https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#i-have-a-couple-of-nodes-with-low-utilization-but-they-are-not-scaled-down-why

But Im still having trouble debugging. I only have 1 pod running on my cluster so I don't see why it wouldn't scale to 1 node. How can I debug this further?

Heres some info:

kubectl get nodes
NAME                                                STATUS   ROLES    AGE     VERSION
gke-qua-gke-foobar1234-default-pool-6302174e-4k84   Ready    <none>   4h14m   v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-6wfs   Ready    <none>   16d     v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-74lm   Ready    <none>   4h13m   v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-m223   Ready    <none>   4h13m   v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-srlg   Ready    <none>   66d     v1.14.10-gke.27

kubectl get pods
NAME                                  READY   STATUS    RESTARTS   AGE
qua-gke-foobar1234-5959446675-njzh4   1/1     Running   0          14m

nodePools:
- autoscaling:
    enabled: true
    maxNodeCount: 10
    minNodeCount: 1
  config:
    diskSizeGb: 100
    diskType: pd-standard
    imageType: COS
    machineType: n1-highcpu-32
    metadata:
      disable-legacy-endpoints: 'true'
    oauthScopes:
    - https://www.googleapis.com/auth/datastore
    - https://www.googleapis.com/auth/devstorage.full_control
    - https://www.googleapis.com/auth/pubsub
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    serviceAccount: default
    shieldedInstanceConfig:
      enableIntegrityMonitoring: true
  initialNodeCount: 1
  instanceGroupUrls:
  - https://www.googleapis.com/compute/v1/projects/fooooobbbarrr-dev/zones/us-central1-a/instanceGroupManagers/gke-qua-gke-foobar1234-default-pool-6302174e-grp
  locations:
  - us-central1-a
  management:
    autoRepair: true
    autoUpgrade: true
  name: default-pool
  podIpv4CidrSize: 24
  selfLink: https://container.googleapis.com/v1/projects/ffoooobarrrr-dev/locations/us-central1/clusters/qua-gke-foobar1234/nodePools/default-pool
  status: RUNNING
  version: 1.14.10-gke.27

kubectl describe horizontalpodautoscaler
Name:               qua-gke-foobar1234
Namespace:          default
Labels:             <none>
Annotations:        autoscaling.alpha.kubernetes.io/conditions:
                      [{"type":"AbleToScale","status":"True","lastTransitionTime":"2020-03-17T19:59:19Z","reason":"ReadyForNewScale","message":"recommended size...
                    autoscaling.alpha.kubernetes.io/current-metrics:
                      [{"type":"External","external":{"metricName":"pubsub.googleapis.com|subscription|num_undelivered_messages","metricSelector":{"matchLabels"...
                    autoscaling.alpha.kubernetes.io/metrics:
                      [{"type":"External","external":{"metricName":"pubsub.googleapis.com|subscription|num_undelivered_messages","metricSelector":{"matchLabels"...
                    kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"autoscaling/v2beta1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"qua-gke-foobar1234","namespace":...
CreationTimestamp:  Tue, 17 Mar 2020 12:59:03 -0700
Reference:          Deployment/qua-gke-foobar1234
Min replicas:       1
Max replicas:       10
Deployment pods:    1 current / 1 desired
Events:             <none>

Upvotes: 1

Answers (3)

Milan Ilic

Reputation: 93

I had the same issue and the cause was the lack of PDBs for workloads running in kube-system NS. You can check the "Autoscaler Logs" tab.

If you don't configure PDBs, cluster autoscaler won't remove surplus GKE nodes. https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

There is an interesting discussion about whether there should be some default behaviour or PDB. https://github.com/kubernetes/kubernetes/issues/35318

Upvotes: 1

Daniel Kobe

Reputation: 9825

So the original problem with my debugging attempt was that I ran kubectl get pods and not kubectl get pods --all-namespaces so I couldnt see the pods running on the system. Then I add PDBs on all the system pods.

kubectl create poddisruptionbudget pdb-event --namespace=kube-system --selector k8s-app=event-exporter --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-fluentd-scaler --namespace=kube-system --selector k8s-app=fluentd-gcp-scaler --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-heapster --namespace=kube-system --selector k8s-app=heapster --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-dns --namespace=kube-system --selector k8s-app=kube-dns --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-dnsauto --namespace=kube-system --selector k8s-app=kube-dns-autoscaler --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-glbc --namespace=kube-system --selector k8s-app=glbc --max-unavailable 1

I then was starting to get these errors on some of the pdb event logs. controllermanager Failed to calculate the number of expected pods: found no controllers for pod, I saw these in the pdb evens when I ran kubectl describe pdb --all-namespaces. I dont know why these were occuring but I removed those pdbs. Then everything started working!

Upvotes: 1

Utku Özdemir

Reputation: 7725

HorizontalPodAutoscaler will increase or decrease the number of pods, not nodes. It doesn't have anything to do with the node scaling.

Node scaling is handled by the cloud provider, in your case, by Google Cloud Platform.

You should check if you have node autoscaler enabled or not from the GCP console.

You should follow these steps: 1. Go to the the Kubernetes clusters screen on GCP console 2. Click on your cluster 3. From the bottom, click on the node pool you want to enable autoscaling for 4. Click "edit" 5. Enable autoscaling, define minimum and maximum number of nodes, and save. See the screenshot:

Alternatively, via the gcloud CLI, as described here:

gcloud container clusters update cluster-name --enable-autoscaling \
    --min-nodes 1 --max-nodes 10 --zone compute-zone --node-pool default-pool

Upvotes: 2

Why isn&#39;t GKE scaling down cluster nodes even though I only have one pod?

Answers (3)

Related Questions

Why isn't GKE scaling down cluster nodes even though I only have one pod?