Fisher Coder
Fisher Coder

Reputation: 3576

How can we optimize CPU/core/executor for different stages of a Spark job?

As below picture shows:

enter image description here

My Spark job has three stages:

0. groupBy
1. repartition
2. collect

Stage 0 and 1 are pretty lightweight, however stage 2 is quite CPU intensive.

Is it possible to have different configuration for different stages of one Spark job?

I thought about separate this Spark job into two sub-ones, but that defeats the purpose of using Spark which has all intermediate result stored in memory. And that will also significantly extend our job time.

Any ideas please?

Upvotes: 1

Views: 93

Answers (2)

human
human

Reputation: 2449

I agrre with Shaido's point. But want to include here that Spark 2.x comes with something known as dynamic resource allocation.

https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

At a high level, Spark should relinquish executors when they are no longer used and acquire executors when they are needed.

This means the application can dynamically alter the value instead of using spark.executor.instances

spark.executor.instances is incompatible with spark.dynamicAllocation.enabled. If both spark.dynamicAllocation.enabled and spark.executor.instances are specified, dynamic allocation is turned off and the specified number of spark.executor.instances is used.

Upvotes: 0

Shaido
Shaido

Reputation: 28422

No, it's not possible to change the spark configurations at runtime. See the documentation for SparkConf:

Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.


However, I would guess you do not need to do a repartition before the collect, if there are no other operations in-between. repartition will move the data around on the nodes which is unnecessary if what you want to do is collect them onto the driver node.

Upvotes: 1

Related Questions