Reputation: 369
I'm seeing that if my join condition involves OR clause, it takes a long time vs having an AND clause. I'm doing fullouter join.
My unit test doesn't show a major difference but when run against a large dataset, it runs very slow
df1.join(df2, expr("id1 = id2 AND amount1 = amount2"), "fullouter").cache()
The following runs for long time i.e. magnitude of ~10 times slower
df1.join(df2, expr("id1 = id2 OR amount1 = amount2"), "fullouter").cache()
any thoughts?
Upvotes: 4
Views: 1764
Reputation: 330303
TL;DR Performance difference between these two is expected.
Logical conjunction (AND
) can be expressed as a shuffle based-operation (either sort merge join or shuffle hash join) - this means that each record has to be transferred to only one child partition, based on hash of all expressions included in the join condition.
While in the worst case scenario (all join keys being constant and matching between both datasets) it can take N*M comparisons with all records shuffled to a single partition, typical scenarios, with real-life data, should be much more efficient, making roughly N + M comparisons.
Logical disjunction (OR
) cannot be expressed as a simple shuffle-based operation as records hashing to different buckets can be matched*. Therefore Spark uses Cartesian product followed by selection always transferring each record multiple times and making N*M comparisons.
Upvotes: 5