user305883
user305883

Reputation: 1741

pandas str.replace - keep current value if regex fails to avoid NaN in converting string to numbers

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html

I have a column of values formatted as Strings, some includes a comma, as such:

0     20,7
1       22
2       21
3       20
4     24,4
5     23,8
6     23,6
7     21,6
8     24,3
9     23,3
10    24,2

....

17053      16
17054      16
17055      15
17056      15
17057      19
17058      17
17059      18
17060      19
17061      20
17062      21
17063      20

I want to convert them as float.

Now, I cannot directly apply astype or to_numeric method, otherwise strings as "20,7" won't be interpreted.

The strange thing is that if apply .str.replace(',', '.') to strings that DOES NOT contain the pattern, the regex fails, and return to NaN. I would have expected a replacement if pattern is found, else return the value as it is.

I triead also .str.replace(',', '.', regex = False) with no success.

Example:

For values like "20,7" the replacement dot to comma works fine, and I can then convert to number, getting a float 20.7.

But I also have these values:

test['TMEDIA °C'][-10:]

17054    16
17055    15
17056    15
17057    19
17058    17
17059    18
17060    19
17061    20
17062    21
17063    20

and apply str.replace I get this:

test['TMEDIA °C'][-10:].str.replace(',', '.')

17054   NaN
17055   NaN
17056   NaN
17057   NaN
17058   NaN
17059   NaN
17060   NaN
17061   NaN
17062   NaN
17063   NaN

and cannot apply astype to the whole column, instead I would like to get this:

test['TMEDIA °C'][-10:].astype(float)
17054    16.0
17055    15.0
17056    15.0
17057    19.0
17058    17.0
17059    18.0
17060    19.0
17061    20.0
17062    21.0
17063    20.0

Is it a bug of str.replace() or how can I keep the current value if regex fails ?

Upvotes: 3

Views: 1553

Answers (3)

a_guest
a_guest

Reputation: 36269

Most likely you have a column with mixed data types, something like this:

fails = pd.DataFrame([['1,2'], [3]], columns=['a'])
print(fails['a'].str.replace(',', '.'))  # results in NaN

How the column ended up like this, is another question. If you had instead:

works = pd.DataFrame([['1,2'], ['3']], columns=['a'])

then it works.

The mixed data types can for example occur if you concatenate two data frames:

df1 = pd.DataFrame([['1,2']], columns=['a'])
df2 = pd.DataFrame([[3]], columns=['a'])
fails = pd.concat([df1, df2])

If df2 was generated by read_csv and it had no "," values then it is automatically parsed into numbers instead of strings for example.

You can avoid the NaN issue either by using .str.replace on the data frame that contains the comma values, here df1, before concatenating with the other data frame, or by ensuring that the other data frame's column (here d2['a']) has dtype=str as well (e.g. via read_csv(..., dtype=str)).

Upvotes: 3

Viach
Viach

Reputation: 508

s = pd.Series(['foo', 'fuz', 'some other'])
idx = s[s.str.contains(r'f', regex=True)].index
s.loc[idx] = s.loc[idx].str.replace('f', 'F')
display(s)

0           Foo 
1           Fuz
2    some other
dtype: object

Upvotes: 0

Andy L.
Andy L.

Reputation: 25249

Those with NaN after replace having dtype int, so replace returns NaN. Try force converting whole column to str before calling str.replace

test['TMEDIA °C'].astype(str).str.replace(',', '.')

Upvotes: 2

Related Questions