user13316830
user13316830

Reputation:

GCP Instance Group auto-scaling works unexpectedly

I have an auto-scaling instance group on GCP with 1 to 15 instances. The scale-out rule is CPU load more than 50%. No scale-in controls enabled. But I have all 15 instances running constantly. Monitoring shows that for the past 12 hours 5 instances had more than 80% CPU load (up to 300% (what?)) and 10 instances less than 5% CPU load. Why these 10 instances are running if they are not in use? I expect a maximum of a few spare instances while others are fully loaded. But not 2x more spare instances. Why it works so? How to make it work as expected?

The group auto-scaling options are as follows:

autoscaler:
  autoscalingPolicy:
    coolDownPeriodSec: 180
    cpuUtilization:
      utilizationTarget: 0.5
    maxNumReplicas: 15
    minNumReplicas: 1
    mode: ON
    scaleInControl:
      maxScaledInReplicas:
        fixed: 1
      timeWindowSec: 60
  creationTimestamp: '2020-12-04T01:46:57.815-08:00'
  id: '***'
  kind: compute#autoscaler
  name: ***
  recommendedSize: 10
  selfLink: https://www.googleapis.com/compute/v1/projects/***/zones/europe-west4-a/autoscalers/***
  status: ACTIVE
  target: https://www.googleapis.com/compute/v1/projects/***/zones/europe-west4-a/instanceGroupManagers/***
  zone: https://www.googleapis.com/compute/v1/projects/***/zones/europe-west4-a

Upvotes: 0

Views: 1213

Answers (2)

Piotr Tabor
Piotr Tabor

Reputation: 36

It's important how do you distribute work between your instances.

Autoscaler provisions enough instances to meet the target average utilization of your instances. If your group has some instances highly utilized and some idle, it might by issue with the logic that distributes work between instances. Horizontal scaling work well assuming newly created instances are able to take part of work from the highly utilized instances.

Some things to check:

  1. Do you use load-balancer ? Are the target VMs healthy according to healthcheck ?
  2. Does your application have uniform load, or some nodes (e.g. leaders) are doing additional work.

Upvotes: 1

Mahboob
Mahboob

Reputation: 1975

The feature called Predictive Auto Scaling PAS which may be the reason about the MIG scaling up with the CPU metric, still I'm not sure that your cluster has this enabled or not. PAS learn that daily (or at a specific day of the week) at this time there is a peak of load and it reacted accordingly (even if today is unusual and the peak didn't happen). PAS decide to scale up by a predictive algorithm.

If you run the following command for your MIG "your-mig" in order to verify and certify that the MIG will be scaling from the raw metric

gcloud alpha compute instance-groups managed update-autoscaling GROUP \
--cpu-utilization-predictive-method none

Upvotes: 0

Related Questions