max04
max04

Reputation: 6815

How to find out what is new and what has changed while comparing two dataframes in pyspark?

I have two dataframes (over 1 mln records). Only ~10% of rows are different. I know how to find delta:

df1.subtract(df2)

But I would also want to know what records are new and what have changed. I know I can do this using Hive Context once I have delta but maybe there is a simple way to do this based on some pyspark functions?

Thanks in advance.

Upvotes: 2

Views: 1682

Answers (1)

Steven
Steven

Reputation: 15283

Just perform joins with leftsemi and leftanti

df = df1.subtract(df2) #diff dataframe
df.join(df2, how='leftsemi', on='id').show() #will print the modified lines
df.join(df2, how='leftanti', on='id').show() #will print the new lines

Upvotes: 3

Related Questions