Metadata
Metadata

Reputation: 2083

How to understand primary workers while autoscaling GCP Dataproc?

I was going thru the documentation of autoscaling a dataproc cluster. The doc says autoscaling needs a yaml file before you create a cluster with minimum configuration as given below:

workerConfig:
  minInstances: 2
  maxInstances: 100
  weight: 1
secondaryWorkerConfig:
  minInstances: 0
  maxInstances: 100
  weight: 1
basicAlgorithm:
  cooldownPeriod: 2m
  yarnConfig:
    scaleUpFactor: 0.05
    scaleDownFactor: 1.0
    scaleUpMinWorkerFraction: 0.0
    scaleDownMinWorkerFraction: 0.0
    gracefulDecommissionTimeout: 1h

Down the line in the same documentation, I saw this:

Avoid scaling primary workers: Primary workers run HDFS Datanodes, while secondary workers are compute-only workers. Avoid scaling primary workers to avoid running into these issues. For example:

     workerConfig:
       minInstances: 10
       maxInstances: 10
     secondaryWorkerConfig:
       minInstances: 0
       maxInstances: 100

I created a cluster on dataproc that has the following vm instances:

name               role
dev-spark-m      master
dev-spark-w-0    worker
dev-spark-w-1    worker

When the documentation says, Avoid scaling primary workers, should I understand that the configuration of key workerConfig in yaml file corresponds to autoscaling the master dev-spark-m and I should avoid doing it ? In that case, can I simply keep the configuration of secondaryWorkerConfig with no configuration for workerConfig ?

Upvotes: 1

Views: 999

Answers (1)

cyxxy
cyxxy

Reputation: 608

  1. There's no scaling the master. The workerConfig field is always about the primary workers.
  2. Setting the minInstances and maxInstances fields to the same value is equivalent to "no scaling".
  3. You will have to always specify at least the workerConfig.maxInstances value, and the workerConfig.minInstances value defaults to 2.

Upvotes: 1

Related Questions