Pyspark : Subtracting/Difference pyspark dataframes based on all columns

Question

I have two pyspark dataframes like below -

df1

id     city      country       region    continent
1      chicago    USA          NA         NA
2      houston    USA          NA         NA
3      Sydney     Australia    AU         AU
4      London     UK           EU         EU

df2

id     city      country       region    continent
1      chicago    USA          NA         NA
2      houston    USA          NA         NA
3      Paris      France       EU         EU
5      London     UK           EU         EU

I want to find out the rows which exists in df2 but not in df1 based on all column values. So df2 - df1 should result in df_result like below

df_result

id     city      country       region    continent
3      Paris      France       EU         EU
5      London     UK           EU         EU

How can I achieve it in pyspark. Thanks in advance

Cena · Accepted Answer

You can use a left_anti join:

df2.join(df1, on = ["id", "city", "country"], how = "left_anti").show()

+---+------+-------+------+---------+
| id|  city|country|region|continent|
+---+------+-------+------+---------+
|  3| Paris| France|    EU|       EU|
|  5|London|     UK|    EU|       EU|
+---+------+-------+------+---------+

If all columns have non-null values:

df2.join(df1, on = df2.schema.names, how = "left_anti").show()

Pyspark : Subtracting/Difference pyspark dataframes based on all columns

Answers (2)

Create the DF Here

Final output

Related Questions