Reputation: 658
I am currently working on updating a table based on its existence on another table:
Ex:
Dataset A (relatively small, 300k of rows): DepartmentId, EmployeeId, Salary, Error
Dataset B (relatively huge, millions of rows): DepartmentId, EmployeeId, Salary
The logic is: 1. If A's (DepartmentId, EmployeeId) pair exists in B, then update A's salary with B's salary 2. Otherwise, write a message to A's error field
The solution I have now is doing a left outer join on A with B. Is there any other better practices for this type of problem?
Thank you in advance!
Upvotes: 3
Views: 12817
Reputation: 23109
For better performance, you can use broadcast hash join as mention here by @Ram Ghadiyaram
The broadcasted dataframe will be distributed in all the partition which increases the performance in joining.
DataFrame join optimization - Broadcast Hash Join
Hope this helps!
Upvotes: 3