Nicholas Tree
Nicholas Tree

Reputation: 55

None/== vs Null/isNull in Pyspark?

What is the difference between:

F.when(F.col('Name').isNull())

and:

F.when(F.col('Name') == None)

They don't appear to work the same. The first one seems to work better when checking for null values in a column. Sometimes the second method doesn't work for checking null Names. For example if I wanted to check null values and replace the Names that are null to "Missing name" or something, the second method won't do anything sometimes. I'm not sure why it works some times and not other times. But the first method always works.

Upvotes: 3

Views: 13354

Answers (3)

Shantanu Kher
Shantanu Kher

Reputation: 1054

In data world, two Null values (or for the matter two None) are not identical.

Therefore, if you perform == or != operation with two None values, it always results in False. That is the key reason isNull() or isNotNull() functions are built for.

Please take a look at below example for better understanding -

Creating a dataframe with few valid records and one record with None

from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import col

lst = [(1,'sometext'),(2,''),(3, None),(4, 'someothertext')]

myrdd = sc.parallelize(lst).map(lambda x: Row(id=x[0], txt=x[1]))
mydf= sqlContext.createDataFrame(myrdd)

isNull() returns True for Row#3, thus below statement returns one row -

mydf.filter(col("txt").isNull()).show(truncate=False)
+---+----+
|id |txt |
+---+----+
|3  |null|
+---+----+

== operator returns False for Row#3, thus no records are filtered out.

mydf.filter(col("txt") == None).show(truncate=False)
+---+---+
|id |txt|
+---+---+
+---+---+

Upvotes: 1

Raghu
Raghu

Reputation: 1712

Refer here : Filter Pyspark dataframe column with None value

Equality based comparisons with NULL won't work because in SQL NULL is undefined so any attempt to compare it with another value returns NULL

Upvotes: 2

Harben
Harben

Reputation: 1856

F.when(F.col('Name') == None)

None is an object in python, NoneType. It's not the same as an empty string or 0. For this instance, you would want to use

F.when(F.col('Name') is None)

Using == you're checking to see if F.col('Name') value equals the None object which is going to throw things off.

Upvotes: 2

Related Questions