Reputation: 551
I have two dataframes with null values that I'm trying to join using PySpark 2.3.0:
dfA:
# +----+----+
# |col1|col2|
# +----+----+
# | a|null|
# | b| 0|
# | c| 0|
# +----+----+
dfB:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null| x|
# | b| 0| x|
# +----+----+----+
The dataframes are creatable with this script:
dfA = spark.createDataFrame(
[
('a', None),
('b', '0'),
('c', '0')
],
('col1', 'col2')
)
dfB = spark.createDataFrame(
[
('a', None, 'x'),
('b', '0', 'x')
],
('col1', 'col2', 'col3')
)
Join call:
dfA.join(dfB, dfB.columns[:2], how='left').orderBy('col1').show()
Result:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null|null| <- col3 should be x
# | b| 0| x|
# | c| 0|null|
# +----+----+----+
Expected result:
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | a|null| x| <-
# | b| 0| x|
# | c| 0|null|
# +----+----+----+
It works if I set the first row, col2 to anything other than null, but I need to support null values.
I tried using a condition to compare using null-safe equals as outlined in this post like so:
cond = (dfA.col1.eqNullSafe(dfB.col1) & dfA.col2.eqNullSafe(dfB.col2))
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()
Result of null-safe join:
# +----+----+----+----+----+
# |col1|col2|col1|col2|col3|
# +----+----+----+----+----+
# | a|null| a|null| x|
# | b| 0| b| 0| x|
# | c| 0|null|null|null|
# +----+----+----+----+----+
This retains duplicate columns though, I'm still looking for a way to achieve the expected result at the end of a join.
Upvotes: 8
Views: 7660
Reputation: 1
If you have to join with null values with null values in pyspark you should use eqnullsafe in joining conditions, then will match null to null values , spark 2.5 version after is better to use eqnullsafe while joining if need more with examples https://knowledges.co.in/how-to-use-eqnullsafe-in-pyspark-for-null-values/
Upvotes: 0
Reputation: 28322
A simple solution would be to select
the columns that you want to keep. This will let you specify which source dataframe they should come from as well as avoid the duplicate column issue.
dfA.join(dfB, cond, how='left').select(dfA.col1, dfA.col2, dfB.col3).orderBy('col1').show()
Upvotes: 5
Reputation:
This fails, because col1
in orderBy
is ambiguous. You should reference specific source, for example dfA
:
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()
Upvotes: 2