Comparison of a `float` to `np.nan` in Spark Dataframe

Question

Is this expected behaviour? I thought to raise an issue with Spark, but this seems such a basic functionality, that it's hard to imagine that there's a bug here. What am I missing?

Python

import numpy as np

>>> np.nan < 0.0
False

>>> np.nan > 0.0
False

PySpark

from pyspark.sql.functions import col

df = spark.createDataFrame([(np.nan, 0.0),(0.0, np.nan)])
df.show()
#+---+---+
#| _1| _2|
#+---+---+
#|NaN|0.0|
#|0.0|NaN|
#+---+---+

df.printSchema()
#root
# |-- _1: double (nullable = true)
# |-- _2: double (nullable = true)

df.select(col("_1")> col("_2")).show()
#+---------+
#|(_1 > _2)|
#+---------+
#|     true|
#|    false|
#+---------+

user10938362 · Accepted Answer

That is both expected and documented behavior. To quote NaN Semantics section of the official Spark SQL Guide (emphasis mine):

There is specially handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically:

NaN = NaN returns true.

In aggregations, all NaN values are grouped together.

NaN is treated as a normal value in join keys.

NaN values go last when in ascending order, larger than any other numeric value.

AdAs you see ordering behavior is not the only difference, compared to Python NaN. In particular Spark considers NaN's equal:

spark.sql("""
    WITH table AS (SELECT CAST('NaN' AS float) AS x, cast('NaN' AS float) AS y) 
    SELECT x = y, x != y FROM table
""").show()

+-------+-------------+
|(x = y)|(NOT (x = y))|
+-------+-------------+
|   true|        false|
+-------+-------------+

while plain Python

float("NaN") == float("NaN"), float("NaN") != float("NaN")

(False, True)

and NumPy

np.nan == np.nan, np.nan != np.nan

(False, True)

don't.

You can check eqNullSafe docstring for additional examples.

So to get desired result you'll have to explicitly check for NaN's

from pyspark.sql.functions import col, isnan, when

when(isnan("_1") | isnan("_2"), False).otherwise(col("_1") > col("_2"))

Comparison of a `float` to `np.nan` in Spark Dataframe

Answers (1)

Related Questions