Reputation: 1397
I'm using pyspark and I have this issue with the outer join. Basically, if I use the list of column names as 'ON' condition, the result of the join is a inner join, regardless if I specify the 'outer_left' option. The problem disappear if instead of the list of column names, I specify the full equality (i.e. df1.id == df2.id).
In other words:
testDf = sc.parallelize([['a', 1], ['b', 1]]).toDF(['id', 'val1'])
testDf2 = sc.parallelize([['a', 2]]).toDF(['id', 'val2'])
cond = [testDf.id == testDf2.id]
testDf.join(testDf2, cond, how='left_outer').collect()
will return the right left outer join:
[Row(id=u'a', val1=1, id=u'a', val2=2),Row(id=u'b', val1=1, id=None, val2=None)]
but if I use
testDf.join(testDf2, 'id', how='left_outer').collect()
it will return an inner join
[Row(id=u'a', val1=1, val2=2)]
Can you help me understand why? Thanks a lot!
Upvotes: 2
Views: 3324
Reputation: 330063
As it is stated in the official documentation:
If
on
is a string or a list of string indicating the name of the join column(s),the column(s) must exist on both sides, and this performs an inner equi-join.
Upvotes: 3