pyspark dataframe comparison to find columns difference based on key fields

Question

I have to compare two dataframes to find out the columns differences based on one or more key fields using pyspark in a most performance efficient approach since I have to deal with huge dataframes

I have already built a solution for comparing two dataframes using hash match without key field matching like data_compare.df_subtract(self.df_db1_hash,self.df_db2_hash) but scenario is different if I want to use key field match

Note: I have provided sample expected dataframe. Actual requirement is any differences from DataFrame 2 in any columns should be retrieved in output/expected dataframe.

DataFrame 1:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|   romin|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

DataFrame 2:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4| sandiego|  romino|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

Expected dataframe after comparing dataframe 1 and 2


+------+---------+--------+----------+
|emp_id| emp_city|emp_name| emp_phone|
+------+---------+--------+----------+
|     4| sandiego|  romino|9848022331|
+------+---------+--------+----------+

pyspark dataframe comparison to find columns difference based on key fields

Answers (1)

Related Questions