Qubix
Qubix

Reputation: 4353

What is the difference between INNER JOIN and ANTI JOIN

I am absolutely confused by the join terminology in PySpark. I will list how I understand two of these joins, I just want to know if my understanding is correct, since I find the documentation more confusing than helpful.

We start with two dataframes: dfA and dfB.

dfA.join(dfB, 'user', 'inner') 

means join just the rows where dfA and dfB have common elements on the user column. (intersection of A and B on the user column).

dfA.join(dfB, 'user', 'leftanti') 

means construct a dataframe with elements in dfA THAT ARE NOT in dfB.

Are these two correct?

Upvotes: 1

Views: 4317

Answers (1)

Derviş Kayımbaşıoğlu
Derviş Kayımbaşıoğlu

Reputation: 30645

inner join => if DFB does not have record for DFA then it does not return non-matched records.

left anti => most close to Not Exists

-- NOT EXISTS
SELECT *
FROM dfA
WHERE NOT EXISTS (
  SELECT 1
  FROM dfB
  WHERE dfA.DeptName = dfB.DeptName
)

also LEFT OUTER JOIN equivalent

SELECT * 
FROM dfA
LEFT OUTER JOIN DFB
  ON dfA.DeptName = dfB.DeptName
where dfB.DeptName IS NULL;

Upvotes: 1

Related Questions