Reputation: 3576
As below picture shows:
My Spark job has three stages:
0. groupBy
1. repartition
2. collect
Stage 0 and 1 are pretty lightweight, however stage 2 is quite CPU intensive.
Is it possible to have different configuration for different stages of one Spark job?
I thought about separate this Spark job into two sub-ones, but that defeats the purpose of using Spark which has all intermediate result stored in memory. And that will also significantly extend our job time.
Any ideas please?
Upvotes: 1
Views: 93
Reputation: 2449
I agrre with Shaido's point. But want to include here that Spark 2.x comes with something known as dynamic resource allocation.
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
At a high level, Spark should relinquish executors when they are no longer used and acquire executors when they are needed.
This means the application can dynamically alter the value instead of using spark.executor.instances
spark.executor.instances is incompatible with spark.dynamicAllocation.enabled. If both spark.dynamicAllocation.enabled and spark.executor.instances are specified, dynamic allocation is turned off and the specified number of spark.executor.instances is used.
Upvotes: 0
Reputation: 28422
No, it's not possible to change the spark configurations at runtime. See the documentation for SparkConf
:
Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.
However, I would guess you do not need to do a repartition
before the collect
, if there are no other operations in-between. repartition
will move the data around on the nodes which is unnecessary if what you want to do is collect
them onto the driver node.
Upvotes: 1