Rajesh Kumar Dash
Rajesh Kumar Dash

Reputation: 2277

How to remove sort phase in spark dataframe join?

I had created a bucketed table using below command in Spark:

df.write.bucketBy(200, "UserID").sortBy("UserID").saveAsTable("topn_bucket_test")

Size of Table : 50 GB

Then I joined another table (say t2 , size :70 GB)(Bucketed as before ) with above table on UserId column . I found that in the execution plan the table topn_bucket_test was being sorted (but not shuffled) before the join and I expected it to be neither shuffled nor sorted before join as it was bucketed. What can be the reason ? and how to remove sort phase for topn_bucket_test?

Upvotes: 1

Views: 408

Answers (1)

Michael Heil
Michael Heil

Reputation: 18475

As far as I am concerned it is not possible to avoid the sort phase. When using the same bucketBy call it is unlikely that the physical bucketing will be identical in both tables. Imagine the first table having UserID ranging from 1 to 1000 and the second from 1 to 2000. Different UserIDs might end up in the 200 buckets and within those bucket there might be multiple different (and unsorted!) UserIDs.

Upvotes: 1

Related Questions