John Doe
John Doe

Reputation: 10203

Incomprehensible result of a comparison between a string and null value in PySpark

Can anyone explain to me the behavior of a comparison between a string and a null value.

DataFrame:

import pyspark.sql.functions as F

df = spark.createDataFrame([
["text_1,","text_1"],
["text_1,","text_2"],
["text_1,",None]
]).toDF("col1", "col2")


df1 = df.withColumn("equal", F.when(F.col("col1") == F.col("col2"), "equal").otherwise("not equal")
   
+-------+-------+----------+
|col1   |col2   |equal     |
+-------+-------+----------+
|text_1 |text_1 |equal     |
|text_1 |text_2 |not equal |
|text_1 |null   |not equal |  <*
+-------+-------+----------+


df2 = df.withColumn("equal", F.when(F.col("col1") != F.col("col2"), "equal").otherwise("not equal")

+-------+-------+----------+
|col1   |col2   |equal     |
+-------+-------+----------+
|text_1 |text_1 |equal     |
|text_1 |text_2 |not equal |
|text_1 |null   |equal     |   <*
+-------+-------+----------+

The is equal comparison seems to go well but the is not equal goes wrong.

Can anyone explain this to me and how can I solve this without checking on .isNotNull or fill the null values with an empty string (if possible).

Upvotes: 3

Views: 3061

Answers (2)

blackbishop
blackbishop

Reputation: 32660

NULL is not a value but represents the lack of value. You can't compare it to a value nor to another null, the comparison always interpreted as false in if, when functions. For null safe equality check, you can also use SQL expression is distinct from:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "equal",
    F.when(
        F.expr("col1 is distinct from col2"), "not equal"
    ).otherwise("equal")
)

df1.show()

# +------+------+---------+
# |  col1|  col2|    equal|
# +------+------+---------+
# |text_1|text_1|    equal|
# |text_1|text_2|not equal|
# |text_1|  null|not equal|
# +------+------+---------+

Or using <=> operator:

df1 = df.withColumn(
    "equal",
    F.when(
        F.expr("col1 <=> col2"), "equal"
    ).otherwise("not equal")
)

Upvotes: 3

mck
mck

Reputation: 42352

The reason why you got equal for comparison with null is because text1 != null gives null, which is interpreted as false by the when statement, so you got the unexpected equal from the otherwise statement.

You can use eqNullSafe, which returns False instead of null when one of the column is null. If you want to compare inequality, use the negation ~ of eqNullSafe.

import pyspark.sql.functions as F

df3 = df.withColumn("equal",
    F.when(~F.col("col1").eqNullSafe(F.col("col2")), "not equal")
     .otherwise("equal")
)

df3.show()
+------+------+---------+
|  col1|  col2|    equal|
+------+------+---------+
|text_1|text_1|    equal|
|text_1|text_2|not equal|
|text_1|  null|not equal|
+------+------+---------+

If you want to fill the null values with an empty string, you can use coalesce:

import pyspark.sql.functions as F

df4 = df.withColumn("equal",
    F.when(F.col("col1") != F.coalesce(F.col("col2"), F.lit("")), "not equal")
     .otherwise("equal")
)

df4.show()
+------+------+---------+
|  col1|  col2|    equal|
+------+------+---------+
|text_1|text_1|    equal|
|text_1|text_2|not equal|
|text_1|  null|not equal|
+------+------+---------+

Upvotes: 2

Related Questions