Kevin Hansen
Kevin Hansen

Reputation: 85

Having trouble replacing empty strings with NaN using Pandas.DataFranme.replace()

I have a pandas dataframe which has some observations with empty strings which I want to replace with NaN (np.nan).

I am successfully replacing most of these empty strings using

df.replace(r'\s+',np.nan,regex=True).replace('',np.nan)

But I am still finding empty strings. For example, when I run

sub_df = df[df['OBJECT_COL'] == '']
sub_df.replace(r'\s+', np.nan, regex = True)
print(sub_df['OBJECT_COL'] == '') 

The output all returns True

Is there a different method I should be trying? Is there a way to read the encoding of these cells such that perhaps my .replace() is not effective because the encoding is weird?

Upvotes: 3

Views: 628

Answers (3)

Karn Kumar
Karn Kumar

Reputation: 8816

Another Alternatives.

sub_df.replace(r'^\s+$', np.nan, regex=True)

OR, to replace an empty string and records with only spaces

sub.df.replace(r'^\s*$', np.nan, regex=True)

Alternative:

using apply() with function lambda.

sub_df.apply(lambda x: x.str.strip()).replace('', np.nan)

Just Example illustration:

>>> import numpy as np
>>> import pandas as pd

Example DataFrame having empty strings and whitespaces..

>>> sub_df
        col_A
0
1
2   somevalue
3  othervalue
4

Solutions applied For the different conditions:

Best Solution:

1)

>>> sub_df.replace(r'\s+',np.nan,regex=True).replace('',np.nan)
        col_A
0         NaN
1         NaN
2   somevalue
3  othervalue
4         NaN

2) This works but partially not for both cases:

>>> sub_df.replace(r'^\s+$', np.nan, regex=True)
        col_A
0
1         NaN
2   somevalue
3  othervalue
4         NaN

3) This also works for both conditions.

>>> sub_df.replace(r'^\s*$', np.nan, regex=True)

            col_A
    0         NaN
    1         NaN
    2   somevalue
    3  othervalue
    4         NaN

4) This also works for both conditions.

>>> sub_df.apply(lambda x: x.str.strip()).replace('', np.nan)
        col_A
0         NaN
1         NaN
2   somevalue
3  othervalue
4         NaN

Upvotes: 3

jpp
jpp

Reputation: 164693

pd.Series.replace does not work in-place by default. You need to specify inplace=True explicitly:

sub_df.replace(r'\s+', np.nan, regex=True, inplace=True)

Or, alternatively, assign back to sub_df:

sub_df = sub_df.replace(r'\s+', np.nan, regex=True)

Upvotes: 2

Mohit Motwani
Mohit Motwani

Reputation: 4792

Try np.where:

df['OBJECT_COL'] = np.where(df['OBJECT_COL'] == '', np.nan, df['OBJECT_COL'])

Upvotes: 0

Related Questions