vincwng
vincwng

Reputation: 658

Join relatively small table with large table in Spark 2.1

I am currently working on updating a table based on its existence on another table:

Ex:

Dataset A (relatively small, 300k of rows): DepartmentId, EmployeeId, Salary, Error

Dataset B (relatively huge, millions of rows): DepartmentId, EmployeeId, Salary

The logic is: 1. If A's (DepartmentId, EmployeeId) pair exists in B, then update A's salary with B's salary 2. Otherwise, write a message to A's error field

The solution I have now is doing a left outer join on A with B. Is there any other better practices for this type of problem?

Thank you in advance!

Upvotes: 3

Views: 12817

Answers (1)

koiralo
koiralo

Reputation: 23109

For better performance, you can use broadcast hash join as mention here by @Ram Ghadiyaram

The broadcasted dataframe will be distributed in all the partition which increases the performance in joining.

DataFrame join optimization - Broadcast Hash Join

Hope this helps!

Upvotes: 3

Related Questions