Reputation: 637
I have two pyspark dataframes like below -
df1
id city country region continent
1 chicago USA NA NA
2 houston USA NA NA
3 Sydney Australia AU AU
4 London UK EU EU
df2
id city country region continent
1 chicago USA NA NA
2 houston USA NA NA
3 Paris France EU EU
5 London UK EU EU
I want to find out the rows which exists in df2 but not in df1 based on all column values. So df2 - df1 should result in df_result like below
df_result
id city country region continent
3 Paris France EU EU
5 London UK EU EU
How can I achieve it in pyspark. Thanks in advance
Upvotes: 3
Views: 2395
Reputation: 2003
One more easy solution would be using exceptAll()
function. Doc says-
Return a new SparkDataFrame containing rows in this SparkDataFrame but not in another SparkDataFrame while preserving the duplicates. This is equivalent to EXCEPT ALL in SQL. Also as standard in SQL, this function resolves columns by position (not by name)
df_a = spark.createDataFrame([(1,"chicago","USA","NA","NA"),(2,"houston","USA","NA","NA"),(3,"Sydney","Australia","AU","AU"),(4,"London","UK","EU","EU")],[ "id","city","country","region","continent"])
df_a.show(truncate=False)
df_b = spark.createDataFrame([(1,"chicago","USA","NA","NA"),(2,"houston","USA","NA","NA"),(3,"Paris","France","EU","EU"),(5,"London","UK","EU","EU")],[ "id","city","country","region","continent"])
df_b.show(truncate=False)
df_a
+---+-------+---------+------+---------+
|id |city |country |region|continent|
+---+-------+---------+------+---------+
|1 |chicago|USA |NA |NA |
|2 |houston|USA |NA |NA |
|3 |Sydney |Australia|AU |AU |
|4 |London |UK |EU |EU |
+---+-------+---------+------+---------+
df_b
+---+-------+-------+------+---------+
|id |city |country|region|continent|
+---+-------+-------+------+---------+
|1 |chicago|USA |NA |NA |
|2 |houston|USA |NA |NA |
|3 |Paris |France |EU |EU |
|5 |London |UK |EU |EU |
+---+-------+-------+------+---------+
df_final = df_b.exceptAll(df_a)
df_final.show()
+---+------+-------+------+---------+
| id| city|country|region|continent|
+---+------+-------+------+---------+
| 3| Paris| France| EU| EU|
| 5|London| UK| EU| EU|
+---+------+-------+------+---------+
Upvotes: 2
Reputation: 3419
You can use a left_anti
join:
df2.join(df1, on = ["id", "city", "country"], how = "left_anti").show()
+---+------+-------+------+---------+
| id| city|country|region|continent|
+---+------+-------+------+---------+
| 3| Paris| France| EU| EU|
| 5|London| UK| EU| EU|
+---+------+-------+------+---------+
If all columns have non-null values:
df2.join(df1, on = df2.schema.names, how = "left_anti").show()
Upvotes: 5