Jack
Jack

Reputation: 997

pyspark dataframe comparison to find columns difference based on key fields

I have to compare two dataframes to find out the columns differences based on one or more key fields using pyspark in a most performance efficient approach since I have to deal with huge dataframes

I have already built a solution for comparing two dataframes using hash match without key field matching like data_compare.df_subtract(self.df_db1_hash,self.df_db2_hash) but scenario is different if I want to use key field match

Note: I have provided sample expected dataframe. Actual requirement is any differences from DataFrame 2 in any columns should be retrieved in output/expected dataframe.

DataFrame 1:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|   romin|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

DataFrame 2:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4| sandiego|  romino|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

Expected dataframe after comparing dataframe 1 and 2


+------+---------+--------+----------+
|emp_id| emp_city|emp_name| emp_phone|
+------+---------+--------+----------+
|     4| sandiego|  romino|9848022331|
+------+---------+--------+----------+

Upvotes: 0

Views: 2625

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

subract function is what you are looking for, which will check all the columns value for each row and gives you a dataframe which are different from the other dataframe.

df2.subtract(df1).select("emp_id","emp_city","emp_name","emp_phone")

As the api document says

Return a new :class:DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT in SQL.

Upvotes: 2

Related Questions