Kalron
Kalron

Reputation: 884

Replace unwanted strings in pandas dataframe element wise and efficiently

I have a very large dataframe (thousands x thousands) only showing 5 x 3 here, time is the index

                                  col1                col2             col3
time                                                                         
05/04/2018 05:14:52 AM             +unend           +unend                  0
05/04/2018 05:14:57 AM                 0                 0                  0
05/04/2018 05:15:02 AM            30.691             0.000              0.121
05/04/2018 05:15:07 AM            30.691             n. def.            0.108
05/04/2018 05:15:12 AM            30.715             0.000              0.105

As these are coming from some other device (df is produced by pd.read_csv(filename)) the dataframe instead of being a completely float type now ends up having unwanted strings like +unend and n. def.. These are not the classical +infinity or NaN , that df.fillna() could take care off. I would like to replace the strings with 0.0. I saw these answers Pandas replace type issue and replace string in pandas dataframe which although try to do the same thing, are column or row wise, but not elementwise. However, in the comments there were some good hints of proceeding for general case as well.

If i try to do

mask = df.apply(lambda x: x.str.contains(r'+unend|n. def.'))
df[mask] =0.0

i get error: nothing to repeat

if i do

mask = df.apply(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask]=0.0

i would get a Series object with True or False for every column rather than a elementwise mask and therefore an error TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value.

The below

mask = df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) )
df[mask.values]=0.0

does give me the intended result replacing all the unwanted strings with 0.0 However, it is slow (unpythonic?) and also, i am not sure if i can use regex for the check rather than in, especially, if i know there are mixed datatypes. Is there an efficient, fast, robust but also elementwise general way to do this?

Upvotes: 1

Views: 1119

Answers (3)

jezrael
jezrael

Reputation: 863166

As pointed Edchum if need replace all non numeric values to 0 - first to_numeric with errors='coerce' create NaNs for not parseable values and then convert them to 0 by fillna:

df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(0)

If values are not substrings use DataFrame.isin or very nice answer of Haleemur Ali:

df = df.mask(df.isin(['+unend','n. def.']), 0).astype(float)

For substrings with define values:

There are special regex char + and ., so need escape them by \:

df = df.mask(df.astype(str).apply(lambda x: x.str.contains(r'(\+unend|n\. def\.)')), 0).astype(float)

Or use applymap for elemnetwise check:

df = df.mask(df.applymap(lambda x: (str('n. def.') in (str(x)) or (str('unend') in str(x))) ), 0).astype(float)

print (df)
                          col1  col2   col3
time                                       
05/04/2018 05:14:52 AM   0.000   0.0  0.000
05/04/2018 05:14:57 AM   0.000   0.0  0.000
05/04/2018 05:15:02 AM  30.691   0.0  0.121
05/04/2018 05:15:07 AM  30.691   0.0  0.108
05/04/2018 05:15:12 AM  30.715   0.0  0.105

Upvotes: 0

Haleemur Ali
Haleemur Ali

Reputation: 28303

These are not the classical +infinity or NaN , that df.fillna() could take care off

You can specify a list of strings to consider as NA when reading the csv file.

df = pd.read_csv(filename, na_values=['+unend', 'n. def.'])

And then fill the NA values with fillna

Upvotes: 5

jpp
jpp

Reputation: 164773

Do not use pd.Series.str.contains or pd.Series.isin

A more efficient solution to this problem is to use pd.to_numeric to convert try and convert all data to numeric.

Use errors='coerce' to default to NaN, which you can then use with pd.Series.fillna.

cols = ['col1', 'col2', 'col3']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce').fillna(0)

Upvotes: 0

Related Questions