Reputation: 10203
Can anyone explain to me the behavior of a comparison between a string and a null value.
DataFrame:
import pyspark.sql.functions as F
df = spark.createDataFrame([
["text_1,","text_1"],
["text_1,","text_2"],
["text_1,",None]
]).toDF("col1", "col2")
df1 = df.withColumn("equal", F.when(F.col("col1") == F.col("col2"), "equal").otherwise("not equal")
+-------+-------+----------+
|col1 |col2 |equal |
+-------+-------+----------+
|text_1 |text_1 |equal |
|text_1 |text_2 |not equal |
|text_1 |null |not equal | <*
+-------+-------+----------+
df2 = df.withColumn("equal", F.when(F.col("col1") != F.col("col2"), "equal").otherwise("not equal")
+-------+-------+----------+
|col1 |col2 |equal |
+-------+-------+----------+
|text_1 |text_1 |equal |
|text_1 |text_2 |not equal |
|text_1 |null |equal | <*
+-------+-------+----------+
The is equal
comparison seems to go well but the is not equal
goes wrong.
Can anyone explain this to me and how can I solve this without checking on .isNotNull
or fill the null
values with an empty string (if possible).
Upvotes: 3
Views: 3061
Reputation: 32660
NULL
is not a value but represents the lack of value. You can't compare it to a value nor to another null, the comparison always interpreted as false in if
, when
functions. For null safe equality check, you can also use SQL expression is distinct from
:
from pyspark.sql import functions as F
df1 = df.withColumn(
"equal",
F.when(
F.expr("col1 is distinct from col2"), "not equal"
).otherwise("equal")
)
df1.show()
# +------+------+---------+
# | col1| col2| equal|
# +------+------+---------+
# |text_1|text_1| equal|
# |text_1|text_2|not equal|
# |text_1| null|not equal|
# +------+------+---------+
Or using <=>
operator:
df1 = df.withColumn(
"equal",
F.when(
F.expr("col1 <=> col2"), "equal"
).otherwise("not equal")
)
Upvotes: 3
Reputation: 42352
The reason why you got equal
for comparison with null is because text1 != null
gives null, which is interpreted as false by the when
statement, so you got the unexpected equal
from the otherwise
statement.
You can use eqNullSafe
, which returns False instead of null when one of the column is null. If you want to compare inequality, use the negation ~
of eqNullSafe
.
import pyspark.sql.functions as F
df3 = df.withColumn("equal",
F.when(~F.col("col1").eqNullSafe(F.col("col2")), "not equal")
.otherwise("equal")
)
df3.show()
+------+------+---------+
| col1| col2| equal|
+------+------+---------+
|text_1|text_1| equal|
|text_1|text_2|not equal|
|text_1| null|not equal|
+------+------+---------+
If you want to fill the null values with an empty string, you can use coalesce
:
import pyspark.sql.functions as F
df4 = df.withColumn("equal",
F.when(F.col("col1") != F.coalesce(F.col("col2"), F.lit("")), "not equal")
.otherwise("equal")
)
df4.show()
+------+------+---------+
| col1| col2| equal|
+------+------+---------+
|text_1|text_1| equal|
|text_1|text_2|not equal|
|text_1| null|not equal|
+------+------+---------+
Upvotes: 2