Reputation: 2577
When I join two dataframes using left join like this:
df1.join(broadcast(df2), $"id" === $"id1" || $"id2" === $"id3", "left")
Without $"id2" === $"id3"
, it executes very quickly but when both conditions are present, it becomes BroadcastNestedLoopJoin
and becomes very very slow.
Any ideas how I can improve this?
Upvotes: 0
Views: 3353
Reputation: 946
Take for example Spark 3.2.
Spark will delegate to one of the 5 available join strategies. In your case, only if you have an equi-join it can use Broadcast Hash Join(BHJ).
A join condition using disjunctive predicates like $"id" === $"id1" || $"id2" === $"id3"
is a non-equi join.
A join condition using conjunctive predicates predicate1 AND predicate2 AND ..
is an equi join.
In order to get that BHJ you can delegate to multiple equi-joins.
df1
.join(broadcast(df2), $"id" === $"id1", "left")
.join(broadcast(df2), $"id2" === $"id3", "left")
Then validate in the query plan that both joins refer to the same broadcasted data and use a broadcast hash join.
Yes, you will see multiple joins but with no shuffle. Try implementing your joins in terms of equi-joins otherwise you're stuck with BNLJ.
Upvotes: 0
Reputation: 427
BroadcastNestedLoopJoin means nested for-loops to join your data-frames. It will always give a degraded performance.
Can you try below solution:
val resultPart1 = df1.join(broadcast(df2), $"id" === $"id1", "left")
val resultPart2 = df1.join(broadcast(df2), $"id2" === $"id3", "left")
val resultDF = resultPart1.unionByName(resultPart2)
Union causes zero shufflings of data across executors.Hence produces faster results
Upvotes: 6