What does spark.sql.shuffle.partitions exactly refer to?

Question

What exactly does spark.sql.shuffle.partitions refer to? Are we talking of the number of partitions that is the results of a wide transformation, or something that happens in the middle as in some sort of intermediary partitioning before the result partition of the wide transformation?

Because in my understanding, as per a wide transformation we have

Parents RDDs -> shuffle files -> Child RDDs

What does the spark.sql.shuffle.partitions parameter refer to here? The shuffles files or the CHILD RDDs or something else that I ignored?

user10407081 · Accepted Answer

This is already explained in the official docs:

spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.

In other words it is the number of partitions of the child Dataset.

What does spark.sql.shuffle.partitions exactly refer to?

Answers (1)

Related Questions