Which join will Spark choose when all the selection criteria are not met?

Question

We know that in Spark have three types of joins -- Broadcast Join， Shuffle Join and Sort-Merge Join：

when small table join big table，use Broadcast Join；
when small table big than BroadcastJoinThreshold，use Shuffle Join；
when big table join，and join key can sort，use Sort-Merge Join；

What happens in a case where there is a join of two big tables and the join key can't be sorted? Which join type Spark will choose?

mazaneicha · Accepted Answer

Spark 3.0 and above supports these types of joins:

Broadcast hash join (BHJ)
Shuffle hash join
Shuffle sort merge join (SMJ)
Broadcast nested loop join (BNLJ)
Cartesian product join

Their selection is best outlined in the source code for SparkStrategies.scala:

  /**
   * Select the proper physical plan for join based on join strategy hints, the availability of
   * equi-join keys and the sizes of joining relations. Below are the existing join strategies,
   * their characteristics and their limitations.
   *
   * - Broadcast hash join (BHJ):
   *     Only supported for equi-joins, while the join keys do not need to be sortable.
   *     Supported for all join types except full outer joins.
   *     BHJ usually performs faster than the other join algorithms when the broadcast side is
   *     small. However, broadcasting tables is a network-intensive operation and it could cause
   *     OOM or perform badly in some cases, especially when the build/broadcast side is big.
   *
   * - Shuffle hash join:
   *     Only supported for equi-joins, while the join keys do not need to be sortable.
   *     Supported for all join types except full outer joins.
   *
   * - Shuffle sort merge join (SMJ):
   *     Only supported for equi-joins and the join keys have to be sortable.
   *     Supported for all join types.
   *
   * - Broadcast nested loop join (BNLJ):
   *     Supports both equi-joins and non-equi-joins.
   *     Supports all the join types, but the implementation is optimized for:
   *       1) broadcasting the left side in a right outer join;
   *       2) broadcasting the right side in a left outer, left semi, left anti or existence join;
   *       3) broadcasting either side in an inner-like join.
   *     For other cases, we need to scan the data multiple times, which can be rather slow.
   *
   * - Shuffle-and-replicate nested loop join (a.k.a. cartesian product join):
   *     Supports both equi-joins and non-equi-joins.
   *     Supports only inner like joins.
   */
object JoinSelection extends Strategy with PredicateHelper { ...

As stated, the outcome of applying the selection depends not only on the size of the tables and sortability of the keys, but also on a join type (INNER, LEFT/RIGHT, FULL) and join key conditions (equi- vs non-equi/theta). Overall, seems like in your situation you'll be looking at either Shuffle Hash or Broadcast Nested Loop.

Which join will Spark choose when all the selection criteria are not met?

Answers (1)

Related Questions