Reputation: 21971
In the dataframe below:
T2MN T2MX RH2M DFP2M RAIN
6.96 9.32 84.27 5.57 -
6.31 10.46 - 5.63 -
- 10.66 79.38 3.63 -
0.79 4.45 94.24 1.85 -
1.45 3.99 91.71 1.17 -
How do I replace all the -
with NaN's. I do not want to specify column names since I do not know before hand which column will have -
Upvotes: 4
Views: 1029
Reputation: 210842
It looks like you were reading this data from CSV/FWF file... If it's true the easiest way to get rid of '-'
would be to explain Pandas that it's NaN
's representation:
df = pd.read_csv(filename, na_values=['NaN', 'nan', '-'])
Test:
In [79]: df
Out[79]:
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 NaN
1 6.31 10.46 NaN 5.63 NaN
2 NaN 10.66 79.38 3.63 NaN
3 0.79 4.45 94.24 1.85 NaN
4 1.45 3.99 91.71 1.17 NaN
In [80]: df.dtypes
Out[80]:
T2MN float64
T2MX float64
RH2M float64
DFP2M float64
RAIN float64
dtype: object
Upvotes: 1
Reputation: 294278
If those are strings, then your floats are probably also strings.
Assuming your dataframe is df
, I'd try
pd.to_numeric(df.stack(), 'coerce').unstack()
Deeper explanation
Pandas doesn't usually represent missing floats with '-'
. Therefore, that '-'
must be a string. Thus, the dtype
of any column with a '-'
in it, must be 'object'
. That makes it highly likely that whatever parsed the data, left the floats as string.
setup
from io import StringIO
import pandas as pd
txt = """T2MN T2MX RH2M DFP2M RAIN
6.96 9.32 84.27 5.57 -
6.31 10.46 - 5.63 -
- 10.66 79.38 3.63 -
0.79 4.45 94.24 1.85 -
1.45 3.99 91.71 1.17 - """
df = pd.read_csv(StringIO(txt), delim_whitespace=True)
print(df)
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 -
1 6.31 10.46 - 5.63 -
2 - 10.66 79.38 3.63 -
3 0.79 4.45 94.24 1.85 -
4 1.45 3.99 91.71 1.17 -
What are the dtypes
?
print(df.dtypes)
T2MN object
T2MX float64
RH2M object
DFP2M float64
RAIN object
dtype: object
What is the type of the first element?
print(type(df.iloc[0, 0]))
<class 'str'>
This means that any column with a '-'
is like a column of strings that look like floats. You want to use pd.to_numeric
with parameter errors='coerce'
to force non-numeric items to np.nan
. However, pd.to_numeric
does not operate on a pd.DataFrame
so we stack
and unstack
.
pd.to_numeric(df.stack(), 'coerce').unstack()
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 NaN
1 6.31 10.46 NaN 5.63 NaN
2 NaN 10.66 79.38 3.63 NaN
3 0.79 4.45 94.24 1.85 NaN
4 1.45 3.99 91.71 1.17 NaN
Upvotes: 5
Reputation: 2006
I think you want the actual numpy.nan
instead of a string NaN
as you can use a lot of methods such as fillna
/isnull
/notnull
on the pandas.Series
/pandas.DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame([['-']*10]*10)
df = df.replace('-',np.nan)
Upvotes: 2
Reputation: 37930
Just replace()
the string:
In [10]: df.replace('-', 'NaN')
Out[10]:
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 NaN
1 6.31 10.46 NaN 5.63 NaN
2 NaN 10.66 79.38 3.63 NaN
3 0.79 4.45 94.24 1.85 NaN
4 1.45 3.99 91.71 1.17 NaN
Upvotes: 5