Reputation: 1741
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html
I have a column of values formatted as Strings, some includes a comma, as such:
0 20,7
1 22
2 21
3 20
4 24,4
5 23,8
6 23,6
7 21,6
8 24,3
9 23,3
10 24,2
....
17053 16
17054 16
17055 15
17056 15
17057 19
17058 17
17059 18
17060 19
17061 20
17062 21
17063 20
I want to convert them as float.
Now, I cannot directly apply astype or to_numeric method, otherwise strings as "20,7" won't be interpreted.
The strange thing is that if apply .str.replace(',', '.')
to strings that DOES NOT contain the pattern, the regex fails, and return to NaN. I would have expected a replacement if pattern is found, else return the value as it is.
I triead also .str.replace(',', '.', regex = False)
with no success.
Example:
For values like "20,7" the replacement dot to comma works fine, and I can then convert to number, getting a float 20.7.
But I also have these values:
test['TMEDIA °C'][-10:]
17054 16
17055 15
17056 15
17057 19
17058 17
17059 18
17060 19
17061 20
17062 21
17063 20
and apply str.replace I get this:
test['TMEDIA °C'][-10:].str.replace(',', '.')
17054 NaN
17055 NaN
17056 NaN
17057 NaN
17058 NaN
17059 NaN
17060 NaN
17061 NaN
17062 NaN
17063 NaN
and cannot apply astype to the whole column, instead I would like to get this:
test['TMEDIA °C'][-10:].astype(float)
17054 16.0
17055 15.0
17056 15.0
17057 19.0
17058 17.0
17059 18.0
17060 19.0
17061 20.0
17062 21.0
17063 20.0
Is it a bug of str.replace() or how can I keep the current value if regex fails ?
Upvotes: 3
Views: 1553
Reputation: 36269
Most likely you have a column with mixed data types, something like this:
fails = pd.DataFrame([['1,2'], [3]], columns=['a'])
print(fails['a'].str.replace(',', '.')) # results in NaN
How the column ended up like this, is another question. If you had instead:
works = pd.DataFrame([['1,2'], ['3']], columns=['a'])
then it works.
The mixed data types can for example occur if you concatenate two data frames:
df1 = pd.DataFrame([['1,2']], columns=['a'])
df2 = pd.DataFrame([[3]], columns=['a'])
fails = pd.concat([df1, df2])
If df2
was generated by read_csv
and it had no ","
values then it is automatically parsed into numbers instead of strings for example.
You can avoid the NaN issue either by using .str.replace
on the data frame that contains the comma values, here df1
, before concatenating with the other data frame, or by ensuring that the other data frame's column (here d2['a']
) has dtype=str
as well (e.g. via read_csv(..., dtype=str)
).
Upvotes: 3
Reputation: 508
s = pd.Series(['foo', 'fuz', 'some other'])
idx = s[s.str.contains(r'f', regex=True)].index
s.loc[idx] = s.loc[idx].str.replace('f', 'F')
display(s)
0 Foo
1 Fuz
2 some other
dtype: object
Upvotes: 0
Reputation: 25249
Those with NaN
after replace
having dtype int
, so replace
returns NaN. Try force converting whole column to str
before calling str.replace
test['TMEDIA °C'].astype(str).str.replace(',', '.')
Upvotes: 2