Reputation: 26458
I'm a little confused about whether Sort Merge Join involves a shuffle phase before the sort phase. Some articles say yes, but why isn't it called Shuffle Sort Merge Join, which is more consistent with Shuffle Hash Join.
Upvotes: 5
Views: 3570
Reputation: 5068
TLDR: Yes, Spark Sort Merge Join involves a shuffle phase. And we can speculate that it is not called Shuffle Sort Merge Join because there is no Broadcast Sort Merge Join to distinguish from.
Spark's sort merge join algorithm distributes data across executors using shuffle. Let's see it with an example.
So imagine you want to join following datasetA
:
id | value |
---|---|
3 | a3 |
1 | a1 |
4 | a4 |
2 | a2 |
With following datasetB
:
id | value |
---|---|
2 | b2 |
4 | b4 |
3 | b3 |
1 | b1 |
To do so you have a Spark application on 2 executors and you use sort merge strategy. Let's detail each step.
Imagine we use modulo 2 partition function. Data will be redistributed on the two executors as below:
Executor 1 get rows whose id has value 1
modulo 2, thus ids 1
and 3
id | valueA |
---|---|
3 | a3 |
1 | a1 |
id | valueB |
---|---|
3 | b3 |
1 | b1 |
Executor 2 get rows whose id has value 0
modulo 2, thus ids 2
and 4
id | valueA |
---|---|
4 | a4 |
2 | a2 |
id | valueB |
---|---|
2 | b2 |
4 | b4 |
id | valueA |
---|---|
1 | a1 |
3 | a3 |
id | valueB |
---|---|
1 | b1 |
3 | b3 |
id | valueA |
---|---|
2 | a2 |
4 | a4 |
id | valueB |
---|---|
2 | b2 |
4 | b4 |
id | valueA | valueB |
---|---|---|
1 | a1 | b1 |
3 | a3 | b3 |
id | valueA | valueB |
---|---|---|
2 | a2 | b2 |
4 | a4 | b4 |
id | valueA | valueB |
---|---|---|
1 | a1 | b1 |
3 | a3 | b3 |
2 | a2 | b2 |
4 | a4 | b4 |
So, when you use a sort merge strategy, you first shuffle data.
If you look into Spark's code, you see that class SortMergeJoinExec
that execute a sort merge join extends trait ShuffledJoin
. So the code tells you that there is a shuffle when executing Sort Merge Join.
In a classic relational database management system, as everything is done on the same executor/server/machine, you only have to chose join strategy. Main join strategies are:
As Spark is a distributed computing framework that runs on several executors, you first have to split your big datasets into smaller parts that can be independently distributed on each executor that will apply those join algorithms. To do this, you have two strategies:
So when you ask Spark to join two datasets, Spark needs to chose two strategies: how it distributes data across executors (broadcast or shuffle) and how it performs actual join (sort merge join, hash join or nested loop join). The combination of those two strategies gives Spark's join strategies:
We can see that Hash Join is the only join strategy that is combined with the two different distribution strategies Broadcast and Shuffle. So we can guess that the Shuffled
prefix was added to avoid confusion between Hash Join with Broadcast and Hash Join with Shuffle.
And so we can imagine that as there is no Broadcast Sort Merge Join, there is no need to put Shuffled
as a prefix, that's why Sort Merge Join strategy is not called Shuffled Sort Merge Join.
Upvotes: 12