What are Shuffled Partitions?

Question

What is spark.sql.shuffle.partitions in a more technical sense? I have seen answers like here which says: "configures the number of partitions that are used when shuffling data for joins or aggregations."

What does that actually mean? How does shuffling work from node to node differently when this number is higher or lower?

Thanks!

Daniel · Accepted Answer

Partitions define where data resides in your cluster. A single partition can contain many rows, but all of them will be processed together in a single task on one node.

Looking at edge cases, if we re-partition our data into a single partition, even if you have 100 executors, it will be only processed by one.

On the other hand, if you have a single executor, but multiple partitions, they will be all (obviously) processed on the same machine.

Shuffles happen, when one executor needs data from another - basic example is groupBy aggregation operation, as we need all related rows to calculate result. Irrespective of how many partitions we had before groupBy, after it spark will split results into spark.sql.shuffle.partitions

Quoting after "Spark - the definitive guide" by Bill Chambers and Matei Zaharia:

A good rule of thumb is that the number of partitions should be larger than the number of executors on your cluster, potentially by multiple factors depending on the workload. If you are running code on your local machine, it would behoove you to set this value lower because your local machine is unlikely to be able to execute that number of tasks in parallel.

So, to sum up, if you set this number lower than your cluster's capacity to run tasks, you won't be able to use all of its resources. On the other hand, since tasks are run on a single partitions, having thousands of small partitions would (I expect) have some overhead.

What are Shuffled Partitions?

Answers (2)

Related Questions