Reputation: 7057
I have a Pandas Dataframe as below:
itm Date Amount
67 420 2012-09-30 00:00:00 65211
68 421 2012-09-09 00:00:00 29424
69 421 2012-09-16 00:00:00 29877
70 421 2012-09-23 00:00:00 30990
71 421 2012-09-30 00:00:00 61303
72 485 2012-09-09 00:00:00 71781
73 485 2012-09-16 00:00:00 NaN
74 485 2012-09-23 00:00:00 11072
75 485 2012-09-30 00:00:00 113702
76 489 2012-09-09 00:00:00 64731
77 489 2012-09-16 00:00:00 NaN
When I try to apply a function to the Amount column, I get the following error:
ValueError: cannot convert float NaN to integer
I have tried applying a function using math.isnan
, pandas' .replace
method, .sparse
data attribute from pandas 0.9, if NaN == NaN
statement in a function; I have also looked at this Q/A; none of them works.
How do I do it?
Upvotes: 673
Views: 1387407
Reputation: 23351
If you're reading data with missing values from a file using read_csv
etc., then you can pass keep_default_na=False
to read missing values as empty strings (""
). In specific cases, this is useful because it achieves what fillna
or replace
does in one function call (with one less copy in memory).
df = pd.read_csv(filepath, keep_default_na=False)
# the above is same as
df = pd.read_csv(filepath).fillna("")
# or
df = pd.read_csv(filepath).replace(np.nan, "")
If the dataframe contains numbers, then you can pass dtypes to read_csv
to construct a dataframe with the desired dtype columns.
df = pd.read_csv(filepath, keep_default_na=False, dtype={"col1": "Int64", "col2": "string", "col3": "Float64"})
Another way to replace NaN is via mask()
/where()
methods. They are similar methods where mask
replaces values that satisfy the condition whereas where
replaces values that do not satisfy the condition. So to use, we just have to filter the NaN values and replace them with the desired value.
import pandas as pd
df = pd.DataFrame({'a': [1, float('nan'), float('nan')], 'b': [float('nan'), 'a', 'b']})
df = df.where(df.notna(), 10) # for the entire dataframe
df['a'] = df['a'].where(df['a'].notna(), 10) # for a single column
The advantage of this method is that we can conditionally replace NaN values with it. The following is an example where NaN values in df
are replaced by 10
if the condition cond
is satisfied.
cond = pd.DataFrame({'a': [True, True, False], 'b':[False, True, True]})
df = df.mask(df.isna() & cond, 10)
Under the hood, fillna()
calls where()
(source) which in turn calls numpy.where()
if the dataframe is small and numexpr.evaluate
if it's large (source). So fillna
/mask
/where
are essentially the same method for the purposes of replacing NaN values. On the other hand, replace()
(another method given on this page) is a numpy.putmask
operation (source). Because numexpr
is a faster than numpy
for large arrays, for very large dataframes, replace
may be outperformed by the other methods.
On a tangential note, it's common for a dataframe to have a literal string 'NaN'
instead of an actual NaN value. To make sure that a dataframe indeed has NaN values, check with df.isna().any()
. If it returns False, when it should contain NaN, then you probably have 'NaN'
strings, in which case, use replace
to convert them into NaN or, even better, replace with the value you're meant to replace it with. For example:
df = pd.DataFrame({'a': ['a', 'b', 'NaN']})
df = df.replace('NaN', 'c')
Upvotes: 0
Reputation: 79
If you want to fill NaN for a specific column you can use loc:
d1 = {"Col1": ['A', 'B', 'C'],
"fruits": ['Avocado', 'Banana', 'NaN']}
d1 = pd.DataFrame(d1)
output:
Col1 fruits
0 A Avocado
1 B Banana
2 C NaN
d1.loc[d1.Col1=='C', 'fruits'] = 'Carrot'
output:
Col1 fruits
0 A Avocado
1 B Banana
2 C Carrot
Upvotes: 2
Reputation: 719
There are two options available primarily; in case of imputation or filling of missing values NaN / np.nan with only numerical replacements (across column(s):
df['Amount'].fillna(value=None, method= ,axis=1,)
is sufficient:
From the Documentation:
value : scalar, dict, Series, or DataFrame Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
Which means 'strings' or 'constants' are no longer permissable to be imputed.
For more specialized imputations use SimpleImputer():
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='constant', missing_values=np.nan, fill_value='Replacement_Value')
df[['Col-1', 'Col-2']] = si.fit_transform(X=df[['C-1', 'C-2']])
Upvotes: 3
Reputation: 642
Considering the particular column Amount
in the above table is of integer type, the following would be a solution:
df['Amount'] = df['Amount'].fillna(0).astype(int)
Similarly, you can fill it with various data types like float
, str
and so on.
In particular, I would consider datatype to compare various values of the same column.
Upvotes: 8
Reputation: 4642
To replace na values in pandas
df['column_name'].fillna(value_to_be_replaced, inplace=True)
if inplace=False
, instead of updating the df (dataframe) it will return the modified values.
Upvotes: 8
Reputation: 135
To replace nan in different columns with different ways:
replacement = {'column_A': 0, 'column_B': -999, 'column_C': -99999}
df.fillna(value=replacement)
Upvotes: 6
Reputation: 4686
I just wanted to provide a special case. If you're using a multi-index or otherwise using an index-slicer, the inplace=True
option may not be enough to update the slice you've chosen. For example in a 2x2 level multi-index this will not change any values (as of pandas 0.15):
idx = pd.IndexSlice
df.loc[idx[:,mask_1], idx[mask_2,:]].fillna(value=0, inplace=True)
The "problem" is that the chaining breaks the fillna ability to update the original dataframe. I put "problem" in quotes because there are good reasons for the design decisions that led to not interpreting through these chains in certain situations. Also, this is a complex example (though I really ran into it), but the same may apply to fewer levels of indexes depending on how you slice.
The solution is DataFrame.update
:
df.update(df.loc[idx[:,mask_1], idx[[mask_2],:]].fillna(value=0))
It's one line, reads reasonably well (sort of) and eliminates any unnecessary messing with intermediate variables or loops while allowing you to apply fillna to any multi-level slice you like!
If anybody can find places this doesn't work please post in the comments, I've been messing with it and looking at the source and it seems to solve at least my multi-index slice problems.
Upvotes: 30
Reputation: 47219
DataFrame.fillna()
or Series.fillna()
will do this for you.
Example:
In [7]: df
Out[7]:
0 1
0 NaN NaN
1 -0.494375 0.570994
2 NaN NaN
3 1.876360 -0.229738
4 NaN NaN
In [8]: df.fillna(0)
Out[8]:
0 1
0 0.000000 0.000000
1 -0.494375 0.570994
2 0.000000 0.000000
3 1.876360 -0.229738
4 0.000000 0.000000
To fill the NaNs in only one column, select just that column.
In [12]: df[1] = df[1].fillna(0)
In [13]: df
Out[13]:
0 1
0 NaN 0.000000
1 -0.494375 0.570994
2 NaN 0.000000
3 1.876360 -0.229738
4 NaN 0.000000
Or you can use the built in column-specific functionality:
df = df.fillna({1: 0})
Upvotes: 1016
Reputation: 663
Using lambda expression, it is also possible to replace NaN with 0.
Below is an example:
dss3 = dss2['Score'].apply(lambda x: 0 if dss2['Score'].isnull else x)
print(dss3)
Upvotes: -2
Reputation: 590
I think it's also worth mention and explain the parameters configuration of fillna() like Method, Axis, Limit, etc.
From the documentation we have:
Series.fillna(value=None, method=None, axis=None,
inplace=False, limit=None, downcast=None)
Fill NA/NaN values using the specified method.
Parameters
value [scalar, dict, Series, or DataFrame] Value to use to
fill holes (e.g. 0), alternately a dict/Series/DataFrame
of values specifying which value to use for each index
(for a Series) or column (for a DataFrame). Values not in
the dict/Series/DataFrame will not be filled. This
value cannot be a list.
method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None},
default None] Method to use for filling holes in
reindexed Series pad / ffill: propagate last valid
observation forward to next valid backfill / bfill:
use next valid observation to fill gap axis
[{0 or ‘index’}] Axis along which to fill missing values.
inplace [bool, default False] If True, fill
in-place. Note: this will modify any other views
on this object (e.g., a no-copy slice for a
column in a DataFrame).
limit [int,defaultNone] If method is specified,
this is the maximum number of consecutive NaN
values to forward/backward fill. In other words,
if there is a gap with more than this number of
consecutive NaNs, it will only be partially filled.
If method is not specified, this is the maximum
number of entries along the entire axis where NaNs
will be filled. Must be greater than 0 if not None.
downcast [dict, default is None] A dict of item->dtype
of what to downcast if possible, or the string ‘infer’
which will try to downcast to an appropriate equal
type (e.g. float64 to int64 if possible).
Ok. Let's start with the method=
Parameter this
have forward fill (ffill) and backward fill(bfill)
ffill is doing copying forward the previous
non missing value.
e.g. :
import pandas as pd
import numpy as np
inp = [{'c1':10, 'c2':np.nan, 'c3':200}, {'c1':np.nan,'c2':110, 'c3':210}, {'c1':12,'c2':np.nan, 'c3':220},{'c1':12,'c2':130, 'c3':np.nan},{'c1':12,'c2':np.nan, 'c3':240}]
df = pd.DataFrame(inp)
c1 c2 c3
0 10.0 NaN 200.0
1 NaN 110.0 210.0
2 12.0 NaN 220.0
3 12.0 130.0 NaN
4 12.0 NaN 240.0
Forward fill:
df.fillna(method="ffill")
c1 c2 c3
0 10.0 NaN 200.0
1 10.0 110.0 210.0
2 12.0 110.0 220.0
3 12.0 130.0 220.0
4 12.0 130.0 240.0
Backward fill:
df.fillna(method="bfill")
c1 c2 c3
0 10.0 110.0 200.0
1 12.0 110.0 210.0
2 12.0 130.0 220.0
3 12.0 130.0 240.0
4 12.0 NaN 240.0
The Axis Parameter help us to choose the direction of the fill:
Fill directions:
ffill:
Axis = 1
Method = 'ffill'
----------->
direction
df.fillna(method="ffill", axis=1)
c1 c2 c3
0 10.0 10.0 200.0
1 NaN 110.0 210.0
2 12.0 12.0 220.0
3 12.0 130.0 130.0
4 12.0 12.0 240.0
Axis = 0 # by default
Method = 'ffill'
|
| # direction
|
V
e.g: # This is the ffill default
df.fillna(method="ffill", axis=0)
c1 c2 c3
0 10.0 NaN 200.0
1 10.0 110.0 210.0
2 12.0 110.0 220.0
3 12.0 130.0 220.0
4 12.0 130.0 240.0
bfill:
axis= 0
method = 'bfill'
^
|
|
|
df.fillna(method="bfill", axis=0)
c1 c2 c3
0 10.0 110.0 200.0
1 12.0 110.0 210.0
2 12.0 130.0 220.0
3 12.0 130.0 240.0
4 12.0 NaN 240.0
axis = 1
method = 'bfill'
<-----------
df.fillna(method="bfill", axis=1)
c1 c2 c3
0 10.0 200.0 200.0
1 110.0 110.0 210.0
2 12.0 220.0 220.0
3 12.0 130.0 NaN
4 12.0 240.0 240.0
# alias:
# 'fill' == 'pad'
# bfill == backfill
limit parameter:
df
c1 c2 c3
0 10.0 NaN 200.0
1 NaN 110.0 210.0
2 12.0 NaN 220.0
3 12.0 130.0 NaN
4 12.0 NaN 240.0
Only replace the first NaN element across columns:
df.fillna(value = 'Unavailable', limit=1)
c1 c2 c3
0 10.0 Unavailable 200.0
1 Unavailable 110.0 210.0
2 12.0 NaN 220.0
3 12.0 130.0 Unavailable
4 12.0 NaN 240.0
df.fillna(value = 'Unavailable', limit=2)
c1 c2 c3
0 10.0 Unavailable 200.0
1 Unavailable 110.0 210.0
2 12.0 Unavailable 220.0
3 12.0 130.0 Unavailable
4 12.0 NaN 240.0
downcast parameter:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 4 non-null float64
1 c2 2 non-null float64
2 c3 4 non-null float64
dtypes: float64(3)
memory usage: 248.0 bytes
df.fillna(method="ffill",downcast='infer').info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 5 non-null int64
1 c2 4 non-null float64
2 c3 5 non-null int64
dtypes: float64(1), int64(2)
memory usage: 248.0 bytes
Upvotes: 1
Reputation: 1430
This works for me, but no one's mentioned it. could there be something wrong with it?
df.loc[df['column_name'].isnull(), 'column_name'] = 0
Upvotes: 5
Reputation: 1235
You can also use dictionaries to fill NaN values of the specific columns in the DataFrame rather to fill all the DF with some oneValue.
import pandas as pd
df = pd.read_excel('example.xlsx')
df.fillna( {
'column1': 'Write your values here',
'column2': 'Write your values here',
'column3': 'Write your values here',
'column4': 'Write your values here',
.
.
.
'column-n': 'Write your values here'} , inplace=True)
Upvotes: 14
Reputation: 4498
It is not guaranteed that the slicing returns a view or a copy. You can do
df['column'] = df['column'].fillna(value)
Upvotes: 199
Reputation: 1066
Easy way to fill the missing values:-
filling string columns: when string columns have missing values and NaN values.
df['string column name'].fillna(df['string column name'].mode().values[0], inplace = True)
filling numeric columns: when the numeric columns have missing values and NaN values.
df['numeric column name'].fillna(df['numeric column name'].mean(), inplace = True)
filling NaN with zero:
df['column name'].fillna(0, inplace = True)
Upvotes: 11
Reputation: 31692
You could use replace
to change NaN
to 0
:
import pandas as pd
import numpy as np
# for column
df['column'] = df['column'].replace(np.nan, 0)
# for whole dataframe
df = df.replace(np.nan, 0)
# inplace
df.replace(np.nan, 0, inplace=True)
Upvotes: 70
Reputation: 551
The below code worked for me.
import pandas
df = pandas.read_csv('somefile.txt')
df = df.fillna(0)
Upvotes: 31