Reputation: 992
I have DF1 with ~50k records. DF2 has >5Billion records from s3 parq. I need to do a left outer join on md5 hash in both DFs but as expected it's slow and expensive.
I tried broadcast join but DF1 is quite big as well.
I was wondering what would be the best way to handle this. Should I filter DF2 on those 50k records (md5s) first and then do the join with Df1.
Thanks.
Upvotes: 1
Views: 1345
Reputation: 124
You could try out the following options:
Remove duplicate rows andFilter out any and all columns that are not relevant for downstream processing before calling on the outer join
Persist DF1 to disk before calling the join operation
Figure out if DF2 has data skew on the columns used for joining, then try to split them out based on skew values and process the join separately
You could also try tweaking the execution parameters to tune the job
Upvotes: 1