Adisak Rungruang
Adisak Rungruang

Reputation: 21

Join PySpark SQL DataFrames that are already partitioned in a subset of the keys

I want to join 2 Spark DataFrames that is already partitioned in a subset of the keys I use to join. But when I do it, Exchange operations still occur anyway. How can I join them without Exchange or Broadcast.

For example, I have DataFrame df1 and df2. They both have the same columns that are col1, col2, col3. And they both have already been partitioned using col1. I want to join them using col1 and col2. But when I do it, It get repartitioned again using col1 and col2.

Upvotes: 2

Views: 1755

Answers (1)

Mageswaran
Mageswaran

Reputation: 450

AFAIK,the dataframe needs to be partitioned by same column on both sides, for one shuffle.

Eg:

right_df = right_df.repartition(400)
left_df = left_df.repartition(400) # will lead to one more shuffle when join operation is used down the line
df = left_df.join(right_df, col("id") == col("user_id"), "outter")

enter image description here

left_df = left_df.withColumnRenamed("id", "repartition_id").repartition(400, col("repartition_id")
right_df = right_df.withColumnRenamed("user_id", "repartition_id").repartition(400, col("repartition_id")
df = left_df.join(right_df, "repartition_id", "outter")

enter image description here

Upvotes: 1

Related Questions