PagMax
PagMax

Reputation: 8578

Python ignoring leading whitespace when doing .replace on dataframe column with strings

I have a dataframe column which generally contains numeric but some entries are written as 'No Data'.

df=pd.DataFrame({'group':[1,2,3,' No Data',4,5,6]})

However, some entries, like in the above example, may start with a whitespace as ' No Data'. I am trying to find an efficient way to replace all No Data entries with NaN without worrying about that extra space. For now, what I have is

df['group'].replace('No Data', np.nan)
print df

Which does not work unless I put a space in search term like

df['group'].replace(' No Data', np.nan)

While this works, I was hoping there is a better way I can do it without checking for extra space.

Note : I can also achieve this by list comprehension and using in

[np.nan if 'No Data' in str(x) else x for x in df['group']]

which will work just fine but I think replace is more readable.

Upvotes: 1

Views: 281

Answers (2)

gtlambert
gtlambert

Reputation: 11971

You can achieve your desired result by integrating a regex into the replace function that you suggest:

import pandas as pd
import numpy as np

df=pd.DataFrame({'group':[1,2,3,' No Data',4,5,6]})

df = df.replace(r'No Data', np.nan, regex=True)
print df

Output

   group
0      1
1      2
2      3
3    NaN
4      4
5      5
6      6

Using the regex r'No data' will look for the string 'No data' anywhere in your cell. If it finds a match, the cell will be replaced with np.nan.

An alternative would be to use applymap to apply a simple lambda function to each cell in the dataframe. I think this is a pretty readable solution:

import pandas as pd
import numpy as np

df=pd.DataFrame({'group':[1,2,3,' No Data',4,5,6]})

df = df.applymap(lambda x: np.nan if type(x) == str and 'No Data' in x else x)
print df

Output

   group
0      1
1      2
2      3
3    NaN
4      4
5      5
6      6

Upvotes: 1

bunji
bunji

Reputation: 5223

Pandas allows you to use regular expressions with the replace function so you can do:

df['group'].replace("\s*No Data", np.nan, regex=True)

Upvotes: 1

Related Questions