Reputation: 115
I have some problems with regular expression. I have a dataset with money amount and in some rows there is an odd separator. And i need a regular expression to remove only the odd separator.
For example, this is a data i have:
user_id sum
1 10.10
2 154.24
3 19.565.02
4 2.142.00
And the expected result is:
user_id sum
1 10.10
2 154.24
3 19565.02
4 2142.00
5 1.99
I use python and pandas lib for data analysis.
Help please with regex. Thank you!
Upvotes: 1
Views: 78
Reputation: 42916
slighty different way with conditional column creation using np.where
from the numpy
module:
df['sum'] = np.where(df.sum_col.str.count('\.') >= 2, df.sum_col.str.replace('.', '', 1), df.sum_col )
or for any amount of .
:
df['sum'] = pd.to_numeric([i.replace('.','',x) for i,x in
zip(df['sum'],df['sum'].str.count('\.')-1)])
Returns:
sum_col sum
0 10.10 10.10
1 154.24 154.24
2 19.565.02 19565.02
3 2.142.00 2142.00
The sum
column is the cleaned up column
Upvotes: 1
Reputation: 1143
Well, if your data is formed with 2 decimal places on the end, you can skip the regex and just use python.
For example, let's say you get all your data into a list (negate the header row) you can do the following to fix the dataset:
dirty = ['10.10', '154.24', '19.565.02', '2.142.00', '1.99']
# this is a list comprehension that replaces the any '.' with '' in all
# but the last three characters of your strings
clean = [item[:-3].replace('.', '') + item[-3:] for item in dirty]
>>> clean
['10.10', '154.24', '19565.02', '2142.00', '1.99']
Answer updated thanks to @match.
Upvotes: 3