Reputation: 1
So I've been looking at different ways to compare two PySpark dataframes where we have no key columns.
Let's say I have two dataframes, df1 & df2, with columns col1, col2, col3.
The idea is that I would get an output dataframe containing rows from df1 that do not match with any rows in df2 and vice versa. I would also like some kind of flag so I can distinguish between rows from df1 and rows from df2.
I have so far looked at a full outer join as method, such as:
columns = df1.columns
df1 = df1.withColumn("df1_flag", lit("X"))
df2 = df2.withColumn("df2_flag", lit("X"))
df3 = df1.join(df2, columns, how = 'full')\
.withColumn("FLAG", when(col("df1_flag").isNotNull() & col("df2_flag").isNotNull(), "MATCHED")\
.otherwise(when(col("df1_flag").isNotNull(), "df1").otherwise("df2"))).drop("df1_flag","df2_flag")
df4 = df3.filter(df3.flag != "MATCHED")
The issue with the full outer join is that I may need to deal with some very large dataframes (1 million + records), I am concerned about efficiency. I have thought about using an anti left join and an anti right join and then combining, but still there are efficiency worries with that also.
Is there any method of comparison I am overlooking here that could be more efficient for very large dataframes?
Upvotes: 0
Views: 729
Reputation: 5526
You can run a minus query on your dataframes
Mismatvhed_df1 = df1.exceptAll(df2)
Mismatvhed_df2 = df2.exceptAll(df1)
Upvotes: 0