Dataframe Join Null-Safe Condition Use

I have two dataframes with null values that I'm trying to join using PySpark 2.3.0:

dfA:

# +----+----+
# |col1|col2|
# +----+----+
# |   a|null|
# |   b|   0|
# |   c|   0|
# +----+----+

dfB:

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|   x|
# |   b|   0|   x|
# +----+----+----+

The dataframes are creatable with this script:

dfA = spark.createDataFrame(
    [
        ('a', None),
        ('b', '0'),
        ('c', '0')
    ],
    ('col1', 'col2')
)

dfB = spark.createDataFrame(
    [
        ('a', None, 'x'),
        ('b', '0', 'x')
    ],
    ('col1', 'col2', 'col3')
)

Join call:

dfA.join(dfB, dfB.columns[:2], how='left').orderBy('col1').show()

Result:

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|null|  <- col3 should be x
# |   b|   0|   x|
# |   c|   0|null|
# +----+----+----+

Expected result:

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|   x|  <-
# |   b|   0|   x|
# |   c|   0|null|
# +----+----+----+

It works if I set the first row, col2 to anything other than null, but I need to support null values.

I tried using a condition to compare using null-safe equals as outlined in this post like so:

cond = (dfA.col1.eqNullSafe(dfB.col1) & dfA.col2.eqNullSafe(dfB.col2))
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()

Result of null-safe join:

# +----+----+----+----+----+
# |col1|col2|col1|col2|col3|
# +----+----+----+----+----+
# |   a|null|   a|null|   x|
# |   b|   0|   b|   0|   x|
# |   c|   0|null|null|null|
# +----+----+----+----+----+

This retains duplicate columns though, I'm still looking for a way to achieve the expected result at the end of a join.

Upvotes: 8

Dataframe Join Null-Safe Condition Use

Answers (3)

Related Questions