sojim2
sojim2

Reputation: 1317

spark.sql.shuffle.partitions & spark.default.parallelism cannot change after spark session has started?

Just wanted to confirm that this value cannot change after Spark session has started? Basically, once the spark cluster has been initiated and tasks are running, these values are immutable?

The reason is because if we autoscale up, and our vCPU outnumbers our data partition count, there'll be a performance hit. I realize we can simply have more data partitions than vCPUs however, there is a performance hit of ~20% when we have more data partitions than vCPUs.

Upvotes: 0

Views: 603

Answers (1)

Jonathan Kelly
Jonathan Kelly

Reputation: 1990

No, they are not immutable. You can change them at any time while the application is running, though changes won't take effect for already submitted queries. BTW, there's also the Adaptive Query Execution (AQE) feature, which is enabled by default as of EMR 5.30. This feature actually encompasses multiple sub-features, but one main thing it does is let you set the initial SQL shuffle partitions to a higher value and have it automatically coalesce down to a lower number of partitions after each shuffle stage, depending on the amount of data being shuffled.

Upvotes: 1

Related Questions