user3353185
user3353185

Reputation: 352

Erratic NaN behaviour in numpy/pandas

I've been trying to replace missing values in a Pandas dataframe, but without success. I tried the .fillna method and also tried to loop through the entire data set, checking each cell and replacing NaNs with a chosen value. However, in both cases, Python executes the script without throwing up any errors, but the NaN values remain.

When I dug a bit deeper, I discovered behaviour that seems erratic to me, best demonstrated with an example:

In[ ] X['Smokinginpregnancy'].head() 

Out[ ] 

Index
E09000002          NaN
E09000003     5.216126
E09000004    10.287496
E09000005     3.090379
E09000006     6.080041
Name: Smokinginpregnancy, dtype: float64

I know for a fact that the first item in this column is missing and pandas recognises it as NaN. In fact, if I call this item on its own, python tells me it's NaN:

In [ ] X['Smokinginpregnancy'][0]
Out [ ]
nan

However, when I test whether it's NaN, python returns False.

In [ ] X['Smokinginpregnancy'][0] == np.nan
Out [ ] False

I suspect that when .fillna is being executed, python checks whether the item is NaN but gets back a False, so it continues, leaving the cell alone.

Does anyone know what's going on? Any solutions? (apart from opening the csv file in excel and then manually replacing the values.)

I'm using Anaconda's Python 3 distribution.

Upvotes: 2

Views: 229

Answers (2)

jezrael
jezrael

Reputation: 863741

NaN in pandas can be checked with the function pandas.isnull. I created boolean mask and return subset with NaN values.

Function fillna can be used for one column Smokinginpregnancy (more info in doc):

X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')

or

X['Smokinginpregnancy'].fillna('100', inplace=True)

Warning:
Sometimes inplace=True can be ignored, better is not use. - link, github, github 3 comments.

All together:

print X['Smokinginpregnancy'].head() 
#Index
#E09000002          NaN
#E09000003     5.216126
#E09000004    10.287496
#E09000005     3.090379
#E09000006     6.080041

#check NaN in column Smokinginpregnancy by boolean mask
mask = pd.isnull(X['Smokinginpregnancy'])
XNaN = X[mask]
print XNaN
#           Smokinginpregnancy
#Index
#E09000002                 NaN

#use function fillna for column Smokinginpregnancy
#X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
X['Smokinginpregnancy'].fillna('100', inplace=True)
print X
#          Smokinginpregnancy
#Index
#E09000002                100
#E09000003           5.216126
#E09000004            10.2875
#E09000005           3.090379
#E09000006           6.080041

More information, why comparison doesn't work:

One has to be mindful that in python (and numpy), the nan's can’t compared for equality, but None's can. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan. More info in Bakuriu's answer.

In [11]: None == None
Out[11]: True

In [12]: np.nan == np.nan
Out[12]: False

Upvotes: 1

Bakuriu
Bakuriu

Reputation: 102039

You are doing:

X['Smokinginpregnancy'][0] == np.nan

This is guaranteed to return False because all NaNs compare unequal to everything by IEEE754 standard:

>>> x = float('nan')
>>> x == x
False
>>> x == 1
False
>>> x == float('nan')
False

See also here. You have to use math.isnan to check for NaNs:

>>> math.isnan(x)
True

Or numpy.isnan

So use:

numpy.isnan(X['Smokinginpregnancy'][0])

Regarding pandas.fillna note that this function returns the filled array. Maybe you did something like:

X.fillna(...)

without reassigning X? Alternatively you must pass inplace=True to mutate the dataframe on which you are calling the method.

Upvotes: 3

Related Questions