Join PySpark SQL DataFrames that are already partitioned in a subset of the keys

Question

I want to join 2 Spark DataFrames that is already partitioned in a subset of the keys I use to join. But when I do it, Exchange operations still occur anyway. How can I join them without Exchange or Broadcast.

For example, I have DataFrame df1 and df2. They both have the same columns that are col1, col2, col3. And they both have already been partitioned using col1. I want to join them using col1 and col2. But when I do it, It get repartitioned again using col1 and col2.

Join PySpark SQL DataFrames that are already partitioned in a subset of the keys

Answers (1)

Related Questions