Tibberzz
Tibberzz

Reputation: 551

Dataframe Join Null-Safe Condition Use

I have two dataframes with null values that I'm trying to join using PySpark 2.3.0:

dfA:

# +----+----+
# |col1|col2|
# +----+----+
# |   a|null|
# |   b|   0|
# |   c|   0|
# +----+----+

dfB:

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|   x|
# |   b|   0|   x|
# +----+----+----+

The dataframes are creatable with this script:

dfA = spark.createDataFrame(
    [
        ('a', None),
        ('b', '0'),
        ('c', '0')
    ],
    ('col1', 'col2')
)

dfB = spark.createDataFrame(
    [
        ('a', None, 'x'),
        ('b', '0', 'x')
    ],
    ('col1', 'col2', 'col3')
)

Join call:

dfA.join(dfB, dfB.columns[:2], how='left').orderBy('col1').show()

Result:

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|null|  <- col3 should be x
# |   b|   0|   x|
# |   c|   0|null|
# +----+----+----+

Expected result:

# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# |   a|null|   x|  <-
# |   b|   0|   x|
# |   c|   0|null|
# +----+----+----+

It works if I set the first row, col2 to anything other than null, but I need to support null values.

I tried using a condition to compare using null-safe equals as outlined in this post like so:

cond = (dfA.col1.eqNullSafe(dfB.col1) & dfA.col2.eqNullSafe(dfB.col2))
dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()

Result of null-safe join:

# +----+----+----+----+----+
# |col1|col2|col1|col2|col3|
# +----+----+----+----+----+
# |   a|null|   a|null|   x|
# |   b|   0|   b|   0|   x|
# |   c|   0|null|null|null|
# +----+----+----+----+----+

This retains duplicate columns though, I'm still looking for a way to achieve the expected result at the end of a join.

Upvotes: 8

Views: 7660

Answers (3)

Sid jain
Sid jain

Reputation: 1

If you have to join with null values with null values in pyspark you should use eqnullsafe in joining conditions, then will match null to null values , spark 2.5 version after is better to use eqnullsafe while joining if need more with examples https://knowledges.co.in/how-to-use-eqnullsafe-in-pyspark-for-null-values/

Upvotes: 0

Shaido
Shaido

Reputation: 28322

A simple solution would be to select the columns that you want to keep. This will let you specify which source dataframe they should come from as well as avoid the duplicate column issue.

dfA.join(dfB, cond, how='left').select(dfA.col1, dfA.col2, dfB.col3).orderBy('col1').show()

Upvotes: 5

user9938688
user9938688

Reputation:

This fails, because col1 in orderBy is ambiguous. You should reference specific source, for example dfA:

dfA.join(dfB, cond, how='left').orderBy(dfA.col1).show()

Upvotes: 2

Related Questions