Jonny
Jonny

Reputation: 1

Are there any alternatives to a full outer join for comparing PySpark dataframes with no key columns?

So I've been looking at different ways to compare two PySpark dataframes where we have no key columns.

Let's say I have two dataframes, df1 & df2, with columns col1, col2, col3.

The idea is that I would get an output dataframe containing rows from df1 that do not match with any rows in df2 and vice versa. I would also like some kind of flag so I can distinguish between rows from df1 and rows from df2.

I have so far looked at a full outer join as method, such as:

columns = df1.columns
df1 = df1.withColumn("df1_flag", lit("X"))
df2 = df2.withColumn("df2_flag", lit("X"))
df3 = df1.join(df2, columns, how = 'full')\
    .withColumn("FLAG", when(col("df1_flag").isNotNull() & col("df2_flag").isNotNull(), "MATCHED")\
    .otherwise(when(col("df1_flag").isNotNull(), "df1").otherwise("df2"))).drop("df1_flag","df2_flag")
df4 = df3.filter(df3.flag != "MATCHED")

The issue with the full outer join is that I may need to deal with some very large dataframes (1 million + records), I am concerned about efficiency. I have thought about using an anti left join and an anti right join and then combining, but still there are efficiency worries with that also.

Is there any method of comparison I am overlooking here that could be more efficient for very large dataframes?

Upvotes: 0

Views: 729

Answers (1)

Shubham Jain
Shubham Jain

Reputation: 5526

You can run a minus query on your dataframes

Mismatvhed_df1 = df1.exceptAll(df2)
Mismatvhed_df2 = df2.exceptAll(df1)

Upvotes: 0

Related Questions