Reputation: 21
I want to join 2 Spark DataFrames that is already partitioned in a subset of the keys I use to join. But when I do it, Exchange operations still occur anyway. How can I join them without Exchange or Broadcast.
For example, I have DataFrame df1 and df2. They both have the same columns that are col1, col2, col3. And they both have already been partitioned using col1. I want to join them using col1 and col2. But when I do it, It get repartitioned again using col1 and col2.
Upvotes: 2
Views: 1755
Reputation: 450
AFAIK,the dataframe needs to be partitioned by same column on both sides, for one shuffle.
Eg:
right_df = right_df.repartition(400)
left_df = left_df.repartition(400) # will lead to one more shuffle when join operation is used down the line
df = left_df.join(right_df, col("id") == col("user_id"), "outter")
left_df = left_df.withColumnRenamed("id", "repartition_id").repartition(400, col("repartition_id")
right_df = right_df.withColumnRenamed("user_id", "repartition_id").repartition(400, col("repartition_id")
df = left_df.join(right_df, "repartition_id", "outter")
Upvotes: 1