How to understand primary workers while autoscaling GCP Dataproc?

Question

I was going thru the documentation of autoscaling a dataproc cluster. The doc says autoscaling needs a yaml file before you create a cluster with minimum configuration as given below:

workerConfig:
  minInstances: 2
  maxInstances: 100
  weight: 1
secondaryWorkerConfig:
  minInstances: 0
  maxInstances: 100
  weight: 1
basicAlgorithm:
  cooldownPeriod: 2m
  yarnConfig:
    scaleUpFactor: 0.05
    scaleDownFactor: 1.0
    scaleUpMinWorkerFraction: 0.0
    scaleDownMinWorkerFraction: 0.0
    gracefulDecommissionTimeout: 1h

Down the line in the same documentation, I saw this:

Avoid scaling primary workers: Primary workers run HDFS Datanodes, while secondary workers are compute-only workers. Avoid scaling primary workers to avoid running into these issues. For example:

     workerConfig:
       minInstances: 10
       maxInstances: 10
     secondaryWorkerConfig:
       minInstances: 0
       maxInstances: 100

I created a cluster on dataproc that has the following vm instances:

name               role
dev-spark-m      master
dev-spark-w-0    worker
dev-spark-w-1    worker

When the documentation says, Avoid scaling primary workers, should I understand that the configuration of key workerConfig in yaml file corresponds to autoscaling the master dev-spark-m and I should avoid doing it ? In that case, can I simply keep the configuration of secondaryWorkerConfig with no configuration for workerConfig ?

How to understand primary workers while autoscaling GCP Dataproc?

Answers (1)

Related Questions