Does Spark Sort Merge Join involve a shuffle phase?

Question

I'm a little confused about whether Sort Merge Join involves a shuffle phase before the sort phase. Some articles say yes, but why isn't it called Shuffle Sort Merge Join, which is more consistent with Shuffle Hash Join.

Vincent Doba · Accepted Answer

TLDR: Yes, Spark Sort Merge Join involves a shuffle phase. And we can speculate that it is not called Shuffle Sort Merge Join because there is no Broadcast Sort Merge Join to distinguish from.

Understanding Spark Sort Merge Join with an example

Spark's sort merge join algorithm distributes data across executors using shuffle. Let's see it with an example.

So imagine you want to join following datasetA:

id	value
3	a3
1	a1
4	a4
2	a2

With following datasetB:

id	value
2	b2
4	b4
3	b3
1	b1

To do so you have a Spark application on 2 executors and you use sort merge strategy. Let's detail each step.

1. You shuffle data according to a partition function

Imagine we use modulo 2 partition function. Data will be redistributed on the two executors as below:

executor 1

Executor 1 get rows whose id has value 1 modulo 2, thus ids 1 and 3

DatasetA

id	valueA
3	a3
1	a1

DatasetB

id	valueB
3	b3
1	b1

executor 2

Executor 2 get rows whose id has value 0 modulo 2, thus ids 2 and 4

DatasetA

id	valueA
4	a4
2	a2

DatasetB

id	valueB
2	b2
4	b4

2. You sort datasets on each executors

executor 1

DatasetA

id	valueA
1	a1
3	a3

DatasetB

id	valueB
1	b1
3	b3

executor 2

DatasetA

id	valueA
2	a2
4	a4

DatasetB

id	valueB
2	b2
4	b4

3. You perform join

executor 1

id	valueA	valueB
1	a1	b1
3	a3	b3

executor 2

id	valueA	valueB
2	a2	b2
4	a4	b4

Complete final dataset

id	valueA	valueB
1	a1	b1
3	a3	b3
2	a2	b2
4	a4	b4

So, when you use a sort merge strategy, you first shuffle data.

The code argument

If you look into Spark's code, you see that class SortMergeJoinExec that execute a sort merge join extends trait ShuffledJoin. So the code tells you that there is a shuffle when executing Sort Merge Join.

So why it is not called Shuffled Sort Merge Join ?

In a classic relational database management system, as everything is done on the same executor/server/machine, you only have to chose join strategy. Main join strategies are:

hash join: the most efficient one, using hashes of the join key to pick the matching rows
sort merge join
nested loop join: which is the naive algorithm, iterating over all rows of the two datasets, keeping only matching one

As Spark is a distributed computing framework that runs on several executors, you first have to split your big datasets into smaller parts that can be independently distributed on each executor that will apply those join algorithms. To do this, you have two strategies:

broadcast: copy the smallest dataset to each executor. As executor has all data from smallest dataset, it can join it with part of biggest dataset without relying on other executors. Spark choses this strategy when smallest fits in executor's memory.
shuffle/repartition: copy parts of the two datasets to each executor. For each executor you should have the right parts of the two datasets to prevent having to retrieve data from another executors when performing join.

So when you ask Spark to join two datasets, Spark needs to chose two strategies: how it distributes data across executors (broadcast or shuffle) and how it performs actual join (sort merge join, hash join or nested loop join). The combination of those two strategies gives Spark's join strategies:

Broadcast Hash Join
Shuffled Hash Join
Broadcast Nested Loop Join
Cartesian Product
Sort Merge Join

We can see that Hash Join is the only join strategy that is combined with the two different distribution strategies Broadcast and Shuffle. So we can guess that the Shuffled prefix was added to avoid confusion between Hash Join with Broadcast and Hash Join with Shuffle.

And so we can imagine that as there is no Broadcast Sort Merge Join, there is no need to put Shuffled as a prefix, that's why Sort Merge Join strategy is not called Shuffled Sort Merge Join.

Does Spark Sort Merge Join involve a shuffle phase?

Answers (1)

Understanding Spark Sort Merge Join with an example

1. You shuffle data according to a partition function

executor 1

DatasetA

DatasetB

executor 2

DatasetA

DatasetB

2. You sort datasets on each executors

executor 1

DatasetA

DatasetB

executor 2

DatasetA

DatasetB

3. You perform join

executor 1

executor 2

Complete final dataset

The code argument

So why it is not called Shuffled Sort Merge Join ?

Related Questions