Pyspark : Subtract one dataframe from another based on one column value

Question

I have two pyspark dataframes - Dataframe A

+----+---+
|name| id|
+----+---+
|   a|  3|
|   b|  5|
|   c|  7|
+----+---+

Dataframe B

+----+---+
|name| id|
+----+---+
|   a|  3|
|   b| 10|
|   c| 13|
+----+---+

I want to subtract dataframe B from Dataframe A based on column id. So the result dataframe should be -

+----+---+
|name| id|
+----+---+
|   b|  5|
|   c|  7|
+----+---+

This is my code,

common = A.join(B, ['id'], 'leftsemi')
diff = A.subtract(common)
diff.show()

But it does not give expected result. Is there a simple way to achieve this which can subtract on dataframe from another based on one column value. Unable to find it. Thanks in advance

Lamanus · Accepted Answer

If you want to compare the id column only, then use the left_anti join. This will give you the result that is in df1 but not in df2 by comparing the id column.

df1.join(df2, ['id'], 'left_anti').show(10, False)

+---+----+
|id |name|
+---+----+
|5  |b   |
|7  |c   |
+---+----+

subtract is all.

df1.subtract(df2).show(10, False)

+----+---+
|name|id |
+----+---+
|b   |5  |
|c   |7  |
+----+---+

Pyspark : Subtract one dataframe from another based on one column value

Answers (2)

Related Questions