Reputation: 111
We know that in Spark have three types of joins -- Broadcast Join, Shuffle Join and Sort-Merge Join:
What happens in a case where there is a join of two big tables and the join key can't be sorted? Which join type Spark will choose?
Upvotes: 1
Views: 1312
Reputation: 9427
Spark 3.0 and above supports these types of joins:
Their selection is best outlined in the source code for SparkStrategies.scala
:
/**
* Select the proper physical plan for join based on join strategy hints, the availability of
* equi-join keys and the sizes of joining relations. Below are the existing join strategies,
* their characteristics and their limitations.
*
* - Broadcast hash join (BHJ):
* Only supported for equi-joins, while the join keys do not need to be sortable.
* Supported for all join types except full outer joins.
* BHJ usually performs faster than the other join algorithms when the broadcast side is
* small. However, broadcasting tables is a network-intensive operation and it could cause
* OOM or perform badly in some cases, especially when the build/broadcast side is big.
*
* - Shuffle hash join:
* Only supported for equi-joins, while the join keys do not need to be sortable.
* Supported for all join types except full outer joins.
*
* - Shuffle sort merge join (SMJ):
* Only supported for equi-joins and the join keys have to be sortable.
* Supported for all join types.
*
* - Broadcast nested loop join (BNLJ):
* Supports both equi-joins and non-equi-joins.
* Supports all the join types, but the implementation is optimized for:
* 1) broadcasting the left side in a right outer join;
* 2) broadcasting the right side in a left outer, left semi, left anti or existence join;
* 3) broadcasting either side in an inner-like join.
* For other cases, we need to scan the data multiple times, which can be rather slow.
*
* - Shuffle-and-replicate nested loop join (a.k.a. cartesian product join):
* Supports both equi-joins and non-equi-joins.
* Supports only inner like joins.
*/
object JoinSelection extends Strategy with PredicateHelper { ...
As stated, the outcome of applying the selection depends not only on the size of the tables and sortability of the keys, but also on a join type (INNER
, LEFT/RIGHT
, FULL
) and join key conditions (equi- vs non-equi/theta). Overall, seems like in your situation you'll be looking at either Shuffle Hash or Broadcast Nested Loop.
Upvotes: 6