Reputation: 352
I've been trying to replace missing values in a Pandas dataframe, but without success. I tried the .fillna
method and also tried to loop through the entire data set, checking each cell and replacing NaNs with a chosen value. However, in both cases, Python executes the script without throwing up any errors, but the NaN values remain.
When I dug a bit deeper, I discovered behaviour that seems erratic to me, best demonstrated with an example:
In[ ] X['Smokinginpregnancy'].head()
Out[ ]
Index
E09000002 NaN
E09000003 5.216126
E09000004 10.287496
E09000005 3.090379
E09000006 6.080041
Name: Smokinginpregnancy, dtype: float64
I know for a fact that the first item in this column is missing and pandas recognises it as NaN. In fact, if I call this item on its own, python tells me it's NaN:
In [ ] X['Smokinginpregnancy'][0]
Out [ ]
nan
However, when I test whether it's NaN, python returns False.
In [ ] X['Smokinginpregnancy'][0] == np.nan
Out [ ] False
I suspect that when .fillna
is being executed, python checks whether the item is NaN but gets back a False, so it continues, leaving the cell alone.
Does anyone know what's going on? Any solutions? (apart from opening the csv file in excel and then manually replacing the values.)
I'm using Anaconda's Python 3 distribution.
Upvotes: 2
Views: 229
Reputation: 863741
NaN
in pandas can be checked with the function pandas.isnull. I created boolean mask and return subset with NaN
values.
Function fillna can be used for one column Smokinginpregnancy
(more info in doc):
X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
or
X['Smokinginpregnancy'].fillna('100', inplace=True)
Warning:
Sometimes inplace=True
can be ignored, better is not use. - link, github, github 3 comments.
All together:
print X['Smokinginpregnancy'].head()
#Index
#E09000002 NaN
#E09000003 5.216126
#E09000004 10.287496
#E09000005 3.090379
#E09000006 6.080041
#check NaN in column Smokinginpregnancy by boolean mask
mask = pd.isnull(X['Smokinginpregnancy'])
XNaN = X[mask]
print XNaN
# Smokinginpregnancy
#Index
#E09000002 NaN
#use function fillna for column Smokinginpregnancy
#X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
X['Smokinginpregnancy'].fillna('100', inplace=True)
print X
# Smokinginpregnancy
#Index
#E09000002 100
#E09000003 5.216126
#E09000004 10.2875
#E09000005 3.090379
#E09000006 6.080041
More information, why comparison doesn't work:
One has to be mindful that in python (and numpy), the nan's can’t compared for equality, but None's can. Note that Pandas/numpy uses the fact that np.nan != np.nan
, and treats None
like np.nan
. More info in Bakuriu's
answer.
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False
Upvotes: 1
Reputation: 102039
You are doing:
X['Smokinginpregnancy'][0] == np.nan
This is guaranteed to return False
because all NaNs compare unequal to everything by IEEE754 standard:
>>> x = float('nan')
>>> x == x
False
>>> x == 1
False
>>> x == float('nan')
False
See also here.
You have to use math.isnan
to check for NaNs:
>>> math.isnan(x)
True
Or numpy.isnan
So use:
numpy.isnan(X['Smokinginpregnancy'][0])
Regarding pandas.fillna
note that this function returns the filled array. Maybe you did something like:
X.fillna(...)
without reassigning X
? Alternatively you must pass inplace=True
to mutate the dataframe on which you are calling the method.
Upvotes: 3