Reputation: 4353
I am absolutely confused by the join terminology in PySpark. I will list how I understand two of these joins, I just want to know if my understanding is correct, since I find the documentation more confusing than helpful.
We start with two dataframes: dfA
and dfB
.
dfA.join(dfB, 'user', 'inner')
means join just the rows where dfA
and dfB
have common elements on the user column. (intersection of A and B on the user column).
dfA.join(dfB, 'user', 'leftanti')
means construct a dataframe with elements in dfA
THAT ARE NOT in dfB
.
Are these two correct?
Upvotes: 1
Views: 4317
Reputation: 30645
inner join => if DFB does not have record for DFA then it does not return non-matched records.
left anti => most close to Not Exists
-- NOT EXISTS
SELECT *
FROM dfA
WHERE NOT EXISTS (
SELECT 1
FROM dfB
WHERE dfA.DeptName = dfB.DeptName
)
also LEFT OUTER JOIN equivalent
SELECT *
FROM dfA
LEFT OUTER JOIN DFB
ON dfA.DeptName = dfB.DeptName
where dfB.DeptName IS NULL;
Upvotes: 1