Tanner Clark
Tanner Clark

Reputation: 691

What are Shuffled Partitions?

What is spark.sql.shuffle.partitions in a more technical sense? I have seen answers like here which says: "configures the number of partitions that are used when shuffling data for joins or aggregations."

What does that actually mean? How does shuffling work from node to node differently when this number is higher or lower?

Thanks!

Upvotes: 4

Views: 1817

Answers (2)

Mark
Mark

Reputation: 81

spark.sql.shuffle.partitions is the parameter which determines how many blocks your shuffle will be performed in.

Say you had 40Gb of data and had spark.sql.shuffle.partitions set to 400 then your data will be shuffled in 40gb / 400 sized blocks (assuming your data is evenly distributed).

By changing the spark.sql.shuffle.partitions you change the size of blocks being shuffled and the number of blocks for each shuffle stage.

As Daniel says a rule of thumb is to never have spark.sql.shuffle.partitions set lower than the number of cores for a job.

Upvotes: 2

Daniel
Daniel

Reputation: 1242

Partitions define where data resides in your cluster. A single partition can contain many rows, but all of them will be processed together in a single task on one node.

Looking at edge cases, if we re-partition our data into a single partition, even if you have 100 executors, it will be only processed by one. Single partition explanation

On the other hand, if you have a single executor, but multiple partitions, they will be all (obviously) processed on the same machine. enter image description here

Shuffles happen, when one executor needs data from another - basic example is groupBy aggregation operation, as we need all related rows to calculate result. Irrespective of how many partitions we had before groupBy, after it spark will split results into spark.sql.shuffle.partitions

Quoting after "Spark - the definitive guide" by Bill Chambers and Matei Zaharia:

A good rule of thumb is that the number of partitions should be larger than the number of executors on your cluster, potentially by multiple factors depending on the workload. If you are running code on your local machine, it would behoove you to set this value lower because your local machine is unlikely to be able to execute that number of tasks in parallel.

So, to sum up, if you set this number lower than your cluster's capacity to run tasks, you won't be able to use all of its resources. On the other hand, since tasks are run on a single partitions, having thousands of small partitions would (I expect) have some overhead.

Upvotes: 8

Related Questions