Reputation: 43
I have two dataset, I want to join and find out the How many data in the df1 don't match any of the data we have in the df2 in PySpark
I tried this code:
join = df1.join(df2, df1.studyid != df2.studyid, how='inner')
But this code is not working properly.
Please help me out to solve this problem. For more info ping me in chat.
Thanks
Upvotes: 1
Views: 1439
Reputation: 2946
Use leftanti
:
join = df1.join(df2, df1.studyid == df2.studyid, how='leftanti')
An anti join returns values from the left relation that has no match with the right. It is also referred to as a left anti join.
More information: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-join.html
Upvotes: 4