Node Autoscale Issues with GKE Autopilot

Question

I've begun testing Google Cloud's Kubernetes Engine new Autopilot feature, and I'm having a fair number of issues with the autoscaling backend.

I have a fairly standard cluster configuration with basically all of the defaults selected. I'm attempting to deploy a number of microservices backed by a single nginx ingress to manage traffic.

What I'm seeing is that the compute resources are simply not scaling up, and I'm not seeing anything in the configuration that would suggest why this is.

If I review the cluster logs, I see results that explain why the scaling isn't happening, but I have no explanation to explain the problem it's reporting or how to fix them.

For example:

{
  "noDecisionStatus": {
    "measureTime": "1623700519",
    "noScaleUp": {
      "unhandledPodGroupsTotalCount": 1,
      "unhandledPodGroups": [
        {
          "podGroup": {
            "totalPodCount": 1,
            "samplePod": {
              "name": "proxy-9d779889d-zt8pv",
              "namespace": "default",
              "controller": {
                "kind": "ReplicaSet",
                "apiVersion": "apps/v1",
                "name": "proxy-9d779889d"
              }
            }
          },
          "napFailureReasons": [
            {
              "messageId": "no.scale.up.nap.pod.zonal.resources.exceeded",
              "parameters": [
                "northamerica-northeast1-c"
              ]
            }
          ]
        }
      ],
...

no.scale.up.nap.pod.zonal.resources.exceeded

This seems to suggest I'm not allowed to scale this single-zone cluster beyond where it currently is. But I can't see any documentation explaining this limitation. Additionally, the resources I'm currently using seem way too low to be hitting a quota.

The cluster currently reports a total CPU provision of 3.25, and memory of 12 GB.

That's using 6 deployments of 1 pod each. I can't imagine that being the limit of a single availability zone in GKE.

The logs are consistently updating, which leads me to believe GKE is trying to scale, but it keeps being declined. I need to know why and how I can fix that.

Kubernetes version: 1.19.9-gke.1900

Node Autoscale Issues with GKE Autopilot

Answers (1)

Related Questions