Reputation: 637
I have two pyspark dataframes - Dataframe A
+----+---+
|name| id|
+----+---+
| a| 3|
| b| 5|
| c| 7|
+----+---+
Dataframe B
+----+---+
|name| id|
+----+---+
| a| 3|
| b| 10|
| c| 13|
+----+---+
I want to subtract dataframe B from Dataframe A based on column id. So the result dataframe should be -
+----+---+
|name| id|
+----+---+
| b| 5|
| c| 7|
+----+---+
This is my code,
common = A.join(B, ['id'], 'leftsemi')
diff = A.subtract(common)
diff.show()
But it does not give expected result. Is there a simple way to achieve this which can subtract on dataframe from another based on one column value. Unable to find it. Thanks in advance
Upvotes: 1
Views: 6571
Reputation: 13541
If you want to compare the id
column only, then use the left_anti
join. This will give you the result that is in df1
but not in df2
by comparing the id
column.
df1.join(df2, ['id'], 'left_anti').show(10, False)
+---+----+
|id |name|
+---+----+
|5 |b |
|7 |c |
+---+----+
subtract
is all.
df1.subtract(df2).show(10, False)
+----+---+
|name|id |
+----+---+
|b |5 |
|c |7 |
+----+---+
Upvotes: 3
Reputation: 31
A.subtract(B).orderBy("id").show()
+----+---+
|name| id|
+----+---+
| b| 5|
| c| 7|
+----+---+
Upvotes: 3