Ahalya Hegde
Ahalya Hegde

Reputation: 1641

Pyspark find difference between 2 dataframes of different schema

I have 2 dataframes. 1 has more than 20 columns, other has only id column. I am aware that if the dataframes schema is exactly same, then we can do df1.subtract(df2).

I want the data that is present in df1 but not in df2. Both the frames have id field in common. I tried df1.select('id').subtract(df2.select('id')).collect() But this returns only id column as output. I want all the columns from df1.

I have also tried df1.join(df2, df1.id == df2.id, how="left_anti") , But this also doesn't seem to work.

df1:

+----+---+
|name| id|
+----+---+
|   a|  3|
|   b|  5|
|   c|  7|
+----+---+

df2:

 +---+
 | id|
 +---+
 |  3|
 |  2|
 |  8|
 +---+

expected output:

+----+---+
|name| id|
+----+---+
|   b|  5|
|   c|  7|
+----+---+

I am new to this. Thanks for any help

Upvotes: 3

Views: 386

Answers (1)

Shubham Jain
Shubham Jain

Reputation: 5536

I am getting correct results using left anti join

df1.show()
+----+---+
|name| id|
+----+---+
|   a|  3|
|   b|  5|
|   c|  7|
+----+---+

df2.show()
+---+
| id|
+---+
|  3|
|  2|
|  4|
+---+

df1.join(df2,'id','left_anti').show()

+---+----+
| id|name|
+---+----+
|  7|   c|
|  5|   b|
+---+----+

This is your expected result. Is there any other issue you are facing.

Upvotes: 3

Related Questions