Reputation: 9825
I know there are some existing questions out there, they usually refer to this https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#i-have-a-couple-of-nodes-with-low-utilization-but-they-are-not-scaled-down-why
But Im still having trouble debugging. I only have 1 pod running on my cluster so I don't see why it wouldn't scale to 1 node. How can I debug this further?
Heres some info:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-qua-gke-foobar1234-default-pool-6302174e-4k84 Ready <none> 4h14m v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-6wfs Ready <none> 16d v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-74lm Ready <none> 4h13m v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-m223 Ready <none> 4h13m v1.14.10-gke.27
gke-qua-gke-foobar1234-default-pool-6302174e-srlg Ready <none> 66d v1.14.10-gke.27
kubectl get pods
NAME READY STATUS RESTARTS AGE
qua-gke-foobar1234-5959446675-njzh4 1/1 Running 0 14m
nodePools:
- autoscaling:
enabled: true
maxNodeCount: 10
minNodeCount: 1
config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS
machineType: n1-highcpu-32
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/datastore
- https://www.googleapis.com/auth/devstorage.full_control
- https://www.googleapis.com/auth/pubsub
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
serviceAccount: default
shieldedInstanceConfig:
enableIntegrityMonitoring: true
initialNodeCount: 1
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/fooooobbbarrr-dev/zones/us-central1-a/instanceGroupManagers/gke-qua-gke-foobar1234-default-pool-6302174e-grp
locations:
- us-central1-a
management:
autoRepair: true
autoUpgrade: true
name: default-pool
podIpv4CidrSize: 24
selfLink: https://container.googleapis.com/v1/projects/ffoooobarrrr-dev/locations/us-central1/clusters/qua-gke-foobar1234/nodePools/default-pool
status: RUNNING
version: 1.14.10-gke.27
kubectl describe horizontalpodautoscaler
Name: qua-gke-foobar1234
Namespace: default
Labels: <none>
Annotations: autoscaling.alpha.kubernetes.io/conditions:
[{"type":"AbleToScale","status":"True","lastTransitionTime":"2020-03-17T19:59:19Z","reason":"ReadyForNewScale","message":"recommended size...
autoscaling.alpha.kubernetes.io/current-metrics:
[{"type":"External","external":{"metricName":"pubsub.googleapis.com|subscription|num_undelivered_messages","metricSelector":{"matchLabels"...
autoscaling.alpha.kubernetes.io/metrics:
[{"type":"External","external":{"metricName":"pubsub.googleapis.com|subscription|num_undelivered_messages","metricSelector":{"matchLabels"...
kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"autoscaling/v2beta1","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"qua-gke-foobar1234","namespace":...
CreationTimestamp: Tue, 17 Mar 2020 12:59:03 -0700
Reference: Deployment/qua-gke-foobar1234
Min replicas: 1
Max replicas: 10
Deployment pods: 1 current / 1 desired
Events: <none>
Upvotes: 1
Views: 1391
Reputation: 93
I had the same issue and the cause was the lack of PDBs for workloads running in kube-system NS. You can check the "Autoscaler Logs" tab.
If you don't configure PDBs, cluster autoscaler won't remove surplus GKE nodes. https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node
There is an interesting discussion about whether there should be some default behaviour or PDB. https://github.com/kubernetes/kubernetes/issues/35318
Upvotes: 1
Reputation: 9825
So the original problem with my debugging attempt was that I ran kubectl get pods
and not kubectl get pods --all-namespaces
so I couldnt see the pods running on the system. Then I add PDBs on all the system pods.
kubectl create poddisruptionbudget pdb-event --namespace=kube-system --selector k8s-app=event-exporter --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-fluentd-scaler --namespace=kube-system --selector k8s-app=fluentd-gcp-scaler --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-heapster --namespace=kube-system --selector k8s-app=heapster --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-dns --namespace=kube-system --selector k8s-app=kube-dns --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-dnsauto --namespace=kube-system --selector k8s-app=kube-dns-autoscaler --max-unavailable 1 &&
kubectl create poddisruptionbudget pdb-glbc --namespace=kube-system --selector k8s-app=glbc --max-unavailable 1
I then was starting to get these errors on some of the pdb event logs. controllermanager Failed to calculate the number of expected pods: found no controllers for pod
, I saw these in the pdb evens when I ran kubectl describe pdb --all-namespaces
. I dont know why these were occuring but I removed those pdbs. Then everything started working!
Upvotes: 1
Reputation: 7725
HorizontalPodAutoscaler
will increase or decrease the number of pods, not nodes. It doesn't have anything to do with the node scaling.
Node scaling is handled by the cloud provider, in your case, by Google Cloud Platform.
You should check if you have node autoscaler enabled or not from the GCP console.
You should follow these steps: 1. Go to the the Kubernetes clusters screen on GCP console 2. Click on your cluster 3. From the bottom, click on the node pool you want to enable autoscaling for 4. Click "edit" 5. Enable autoscaling, define minimum and maximum number of nodes, and save. See the screenshot:
Alternatively, via the gcloud
CLI, as described here:
gcloud container clusters update cluster-name --enable-autoscaling \
--min-nodes 1 --max-nodes 10 --zone compute-zone --node-pool default-pool
Upvotes: 2