s.r
s.r

Reputation: 2577

Avoid Broadcast nested loop join

When I join two dataframes using left join like this:

df1.join(broadcast(df2), $"id" === $"id1" || $"id2" === $"id3", "left")

Without $"id2" === $"id3", it executes very quickly but when both conditions are present, it becomes BroadcastNestedLoopJoin and becomes very very slow.

Any ideas how I can improve this?

Upvotes: 0

Views: 3353

Answers (2)

gatear
gatear

Reputation: 946

Take for example Spark 3.2.

Spark will delegate to one of the 5 available join strategies. In your case, only if you have an equi-join it can use Broadcast Hash Join(BHJ).

A join condition using disjunctive predicates like $"id" === $"id1" || $"id2" === $"id3" is a non-equi join.

A join condition using conjunctive predicates predicate1 AND predicate2 AND .. is an equi join.

In order to get that BHJ you can delegate to multiple equi-joins.

df1
.join(broadcast(df2), $"id" === $"id1", "left")
.join(broadcast(df2),  $"id2" === $"id3", "left")

Then validate in the query plan that both joins refer to the same broadcasted data and use a broadcast hash join.

Yes, you will see multiple joins but with no shuffle. Try implementing your joins in terms of equi-joins otherwise you're stuck with BNLJ.

Upvotes: 0

Addy
Addy

Reputation: 427

BroadcastNestedLoopJoin means nested for-loops to join your data-frames. It will always give a degraded performance.

Can you try below solution:

val resultPart1 = df1.join(broadcast(df2), $"id" === $"id1", "left")
val resultPart2 = df1.join(broadcast(df2),  $"id2" === $"id3", "left")

val resultDF = resultPart1.unionByName(resultPart2)

Union causes zero shufflings of data across executors.Hence produces faster results

Upvotes: 6

Related Questions