lilloraffa
lilloraffa

Reputation: 1397

Pyspark SparkSQL: issue with outer join

I'm using pyspark and I have this issue with the outer join. Basically, if I use the list of column names as 'ON' condition, the result of the join is a inner join, regardless if I specify the 'outer_left' option. The problem disappear if instead of the list of column names, I specify the full equality (i.e. df1.id == df2.id).

In other words:

testDf = sc.parallelize([['a', 1], ['b', 1]]).toDF(['id', 'val1'])
testDf2 = sc.parallelize([['a', 2]]).toDF(['id', 'val2'])
cond = [testDf.id == testDf2.id]
testDf.join(testDf2, cond, how='left_outer').collect()

will return the right left outer join:

[Row(id=u'a', val1=1, id=u'a', val2=2),Row(id=u'b', val1=1, id=None, val2=None)]

but if I use

testDf.join(testDf2, 'id', how='left_outer').collect()

it will return an inner join

[Row(id=u'a', val1=1, val2=2)]

Can you help me understand why? Thanks a lot!

Upvotes: 2

Views: 3324

Answers (1)

zero323
zero323

Reputation: 330063

As it is stated in the official documentation:

If on is a string or a list of string indicating the name of the join column(s),

the column(s) must exist on both sides, and this performs an inner equi-join.

Upvotes: 3

Related Questions