How to find out what is new and what has changed while comparing two dataframes in pyspark?

Question

I have two dataframes (over 1 mln records). Only ~10% of rows are different. I know how to find delta:

df1.subtract(df2)

But I would also want to know what records are new and what have changed. I know I can do this using Hive Context once I have delta but maybe there is a simple way to do this based on some pyspark functions?

Thanks in advance.

Steven · Accepted Answer

Just perform joins with leftsemi and leftanti

df = df1.subtract(df2) #diff dataframe
df.join(df2, how='leftsemi', on='id').show() #will print the modified lines
df.join(df2, how='leftanti', on='id').show() #will print the new lines

How to find out what is new and what has changed while comparing two dataframes in pyspark?

Answers (1)

Related Questions