Reputation: 1641
I have 2 dataframes. 1 has more than 20 columns, other has only id
column.
I am aware that if the dataframes schema is exactly same, then we can do df1.subtract(df2)
.
I want the data that is present in df1 but not in df2. Both the frames have id field in common.
I tried df1.select('id').subtract(df2.select('id')).collect()
But this returns only id
column as output. I want all the columns from df1.
I have also tried df1.join(df2, df1.id == df2.id, how="left_anti")
, But this also doesn't seem to work.
df1:
+----+---+
|name| id|
+----+---+
| a| 3|
| b| 5|
| c| 7|
+----+---+
df2:
+---+
| id|
+---+
| 3|
| 2|
| 8|
+---+
expected output:
+----+---+
|name| id|
+----+---+
| b| 5|
| c| 7|
+----+---+
I am new to this. Thanks for any help
Upvotes: 3
Views: 386
Reputation: 5536
I am getting correct results using left anti join
df1.show()
+----+---+
|name| id|
+----+---+
| a| 3|
| b| 5|
| c| 7|
+----+---+
df2.show()
+---+
| id|
+---+
| 3|
| 2|
| 4|
+---+
df1.join(df2,'id','left_anti').show()
+---+----+
| id|name|
+---+----+
| 7| c|
| 5| b|
+---+----+
This is your expected result. Is there any other issue you are facing.
Upvotes: 3