cs95
cs95

Reputation: 402603

Weird null checking behaviour by pd.notnull

This is essentially a rehashing of the content of my answer here.

I came across some weird behaviour when trying to solve this question, using pd.notnull.

Consider

x = ('A4', nan)

I want to check which of these items are null. Using np.isnan directly will throw a TypeError (but I've figured out how to solve that).

Using pd.notnull does not work.

>>> pd.notnull(x)
True

It treats the tuple as a single value (rather than an iterable of values). Furthermore, converting this to a list and then testing also gives an incorrect answer.

>>> pd.notnull(list(x))
array([ True,  True])

Since the second value is nan, the result I'm looking for should be [True, False]. It finally works when you pre-convert to a Series:

>>> pd.Series(x).notnull() 
0     True
1    False
dtype: bool

So, the solution is to Series-ify it and then test the values.

Along similar lines, another (admittedly roundabout) solution is to pre-convert to an object dtype numpy array, and pd.notnull or np.isnan will work directly:

>>> pd.notnull(np.array(x, dtype=object))
Out[151]: array([True,  False])

I imagine that pd.notnull directly converts x to a string array under the covers, rendering the NaN as a string "nan", so it is no longer a "null" value.

Is pd.notnull doing the same thing here? Or is there something else going on under the covers that I should be aware of?

Notes

In [156]: pd.__version__
Out[156]: '0.22.0'

Upvotes: 3

Views: 279

Answers (1)

Grigoriy Mikhalkin
Grigoriy Mikhalkin

Reputation: 5573

Here is the issue related to this behavior: https://github.com/pandas-dev/pandas/issues/20675.

In short, if argument passed to notnull is of type list, internally it is converted to np.array with np.asarray method. This bug occured, because, if no dtype specified, numpy converts np.nan to string(which is not recognized by pd.isnull as null value):

a = ['A4', np.nan]
np.asarray(a)
# array(['A4', 'nan'], dtype='<U3')

This problem was fixed in version 0.23.0, by calling np.asarray with dtype=object.

Upvotes: 3

Related Questions