Reputation: 6815
I have two dataframes (over 1 mln records). Only ~10% of rows are different. I know how to find delta:
df1.subtract(df2)
But I would also want to know what records are new and what have changed. I know I can do this using Hive Context once I have delta but maybe there is a simple way to do this based on some pyspark functions?
Thanks in advance.
Upvotes: 2
Views: 1682
Reputation: 15283
Just perform joins with leftsemi
and leftanti
df = df1.subtract(df2) #diff dataframe
df.join(df2, how='leftsemi', on='id').show() #will print the modified lines
df.join(df2, how='leftanti', on='id').show() #will print the new lines
Upvotes: 3