Reputation: 997
I have to compare two dataframes to find out the columns differences based on one or more key fields using pyspark in a most performance efficient approach since I have to deal with huge dataframes
I have already built a solution for comparing two dataframes using hash match without key field matching like data_compare.df_subtract(self.df_db1_hash,self.df_db2_hash)
but scenario is different if I want to use key field match
Note: I have provided sample expected dataframe. Actual requirement is any differences from DataFrame 2 in any columns should be retrieved in output/expected dataframe.
DataFrame 1:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romin|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
DataFrame 2:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sandiego| romino|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
Expected dataframe after comparing dataframe 1 and 2
+------+---------+--------+----------+
|emp_id| emp_city|emp_name| emp_phone|
+------+---------+--------+----------+
| 4| sandiego| romino|9848022331|
+------+---------+--------+----------+
Upvotes: 0
Views: 2625
Reputation: 41957
subract
function is what you are looking for, which will check all the columns value for each row and gives you a dataframe which are different from the other dataframe.
df2.subtract(df1).select("emp_id","emp_city","emp_name","emp_phone")
As the api document says
Return a new :class:
DataFrame
containing rows in this frame but not in another frame.
This is equivalent toEXCEPT
in SQL.
Upvotes: 2