How can we optimize CPU/core/executor for different stages of a Spark job?

Question

As below picture shows:

My Spark job has three stages:

0. groupBy
1. repartition
2. collect

Stage 0 and 1 are pretty lightweight, however stage 2 is quite CPU intensive.

Is it possible to have different configuration for different stages of one Spark job?

I thought about separate this Spark job into two sub-ones, but that defeats the purpose of using Spark which has all intermediate result stored in memory. And that will also significantly extend our job time.

Any ideas please?

Shaido · Accepted Answer

No, it's not possible to change the spark configurations at runtime. See the documentation for SparkConf:

Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.

However, I would guess you do not need to do a repartition before the collect, if there are no other operations in-between. repartition will move the data around on the nodes which is unnecessary if what you want to do is collect them onto the driver node.

How can we optimize CPU/core/executor for different stages of a Spark job?

Answers (2)

Related Questions