Reputation: 884
I have a very large dataframe (thousands x thousands) only showing 5 x 3 here, time is the index
col1 col2 col3
time
05/04/2018 05:14:52 AM +unend +unend 0
05/04/2018 05:14:57 AM 0 0 0
05/04/2018 05:15:02 AM 30.691 0.000 0.121
05/04/2018 05:15:07 AM 30.691 n. def. 0.108
05/04/2018 05:15:12 AM 30.715 0.000 0.105
As these are coming from some other device (df is produced by pd.read_csv(filename)
) the dataframe instead of being a completely float
type now ends up having unwanted strings like +unend
and n. def.
. These are not the classical +infinity
or NaN
, that df.fillna()
could take care off. I would like to replace the strings with 0.0
. I saw these answers Pandas replace type issue and replace string in pandas dataframe which although try to do the same thing, are column or row wise, but not elementwise. However, in the comments there were some good hints of proceeding for general case as well.
If i try to do
mask = df.apply(lambda x: x.str.contains(r'+unend|n. def.'))
df[mask] =0.0
i get error: nothing to repeat
if i do
mask = df.apply(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask]=0.0
i would get a Series object with True or False for every column rather than a elementwise mask and therefore an error
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
.
The below
mask = df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask.values]=0.0
does give me the intended result replacing all the unwanted strings with 0.0 However, it is slow (unpythonic?) and also, i am not sure if i can use regex for the check rather than in
, especially, if i know there are mixed datatypes. Is there an efficient, fast, robust but also elementwise general way to do this?
Upvotes: 1
Views: 1119
Reputation: 863166
As pointed Edchum
if need replace all non numeric values to 0
- first to_numeric
with errors='coerce'
create NaN
s for not parseable values and then convert them to 0
by fillna
:
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(0)
If values are not substring
s use DataFrame.isin
or very nice answer of Haleemur Ali:
df = df.mask(df.isin(['+unend','n. def.']), 0).astype(float)
For substrings
with define values:
There are special regex char +
and .
, so need escape them by \
:
df = df.mask(df.astype(str).apply(lambda x: x.str.contains(r'(\+unend|n\. def\.)')), 0).astype(float)
Or use applymap
for elemnetwise check:
df = df.mask(df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) ), 0).astype(float)
print (df)
col1 col2 col3
time
05/04/2018 05:14:52 AM 0.000 0.0 0.000
05/04/2018 05:14:57 AM 0.000 0.0 0.000
05/04/2018 05:15:02 AM 30.691 0.0 0.121
05/04/2018 05:15:07 AM 30.691 0.0 0.108
05/04/2018 05:15:12 AM 30.715 0.0 0.105
Upvotes: 0
Reputation: 28303
These are not the classical +infinity or NaN , that df.fillna() could take care off
You can specify a list of strings to consider as NA
when reading the csv file.
df = pd.read_csv(filename, na_values=['+unend', 'n. def.'])
And then fill the NA
values with fillna
Upvotes: 5
Reputation: 164773
Do not use pd.Series.str.contains or pd.Series.isin
A more efficient solution to this problem is to use pd.to_numeric
to convert try and convert all data to numeric.
Use errors='coerce'
to default to NaN
, which you can then use with pd.Series.fillna
.
cols = ['col1', 'col2', 'col3']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce').fillna(0)
Upvotes: 0