Reputation: 2083
I was going thru the documentation of autoscaling a dataproc cluster. The doc says autoscaling needs a yaml file before you create a cluster with minimum configuration as given below:
workerConfig:
minInstances: 2
maxInstances: 100
weight: 1
secondaryWorkerConfig:
minInstances: 0
maxInstances: 100
weight: 1
basicAlgorithm:
cooldownPeriod: 2m
yarnConfig:
scaleUpFactor: 0.05
scaleDownFactor: 1.0
scaleUpMinWorkerFraction: 0.0
scaleDownMinWorkerFraction: 0.0
gracefulDecommissionTimeout: 1h
Down the line in the same documentation, I saw this:
Avoid scaling primary workers: Primary workers run HDFS Datanodes, while secondary workers are compute-only workers. Avoid scaling primary workers to avoid running into these issues. For example:
workerConfig:
minInstances: 10
maxInstances: 10
secondaryWorkerConfig:
minInstances: 0
maxInstances: 100
I created a cluster on dataproc that has the following vm instances:
name role
dev-spark-m master
dev-spark-w-0 worker
dev-spark-w-1 worker
When the documentation says, Avoid scaling primary workers
, should I understand that the configuration of key workerConfig
in yaml file corresponds to autoscaling the master dev-spark-m
and I should avoid doing it ?
In that case, can I simply keep the configuration of secondaryWorkerConfig
with no configuration for workerConfig
?
Upvotes: 1
Views: 999
Reputation: 608
workerConfig
field is always about the primary workers.minInstances
and maxInstances
fields to the same value is equivalent to "no scaling".workerConfig.maxInstances
value, and the workerConfig.minInstances
value defaults to 2.Upvotes: 1