Selecting a Dataproc Cluster Size with autoscaling ON

Question

i am new to the GCP cloud and has probably a very basic question. We are running our PySpark jobs in Dataproc ephemeral cluster with autoscaling property on for the cluster. In our code we have used repartition to create the number of partitions based on the number of cores (either similar or a multiple of the total num of cores in the cluster) to achieve max parllelism of task. Now my question is if the autoscaling property is on, then on run time the num of workers can be added/removed based on the Yarn pending Resource metric(Pending Memories or Pending Cores). So in either case, the total num of cores/executors that is available in the cluster should change. In that case the repartition value that i have used in my code might not help me.

What will be the best possible approach to handle such scenario and find out the process to get the num of partitions in a cluster where the worker nodes are changing.

Appreciate some help in this regard.

Selecting a Dataproc Cluster Size with autoscaling ON

Answers (0)

Related Questions