Reputation: 483
I am writing a regex code to remove the whitespace for financial values of the string in pandas dataframe. Let's suppose a pandas dataframe called df and a column "detail" is in the dataframe df.
For example, the column contains string like:
"maximum USD 1 000 000 per house as respects Detroit"
"minimum $ 3 000 000 per destroyed house due to Earthquake"
I tried using str.replace method in pandas, with regex applied. However,
df['detail'].str.replace(r"(USD)\s*(\d+\s*)+(\d+)", r"USD\s\d+\d+", regex=True)
df['detail'].str.replace(r"(\$)\s*(\d+\s*)+(\d+)", r"\$\s\d+\d+", regex=True)
will return
"maximum USD\s\d+\d+ per house as respects Detroit"
"minimum \$\s\d+\d+ per destroyed house due to Earthquake" rather than replacing the number with the right format.
The expected result is to remove the whitespace for financial values and return:
"maximum USD 1000000 per house as respects Detroit"
"minimum $ 3000000 per destroyed house due to Earthquake"
What code should I try to remove whitespace inside the financial values? The actual data is huge so manual work is going to be immense.
Solved:
Thanks for the answers, I used the following code and it works well. In big data it is possible for the USD/$ to have multiple whitespaces before the financial number (Or it could be 0), so \s+ or \s* provide a safe approach for this problem.
df['detail'] = df['detail'].apply(lambda x: re.sub(r'\s+(\d{3})', r'\1', x))
Upvotes: 2
Views: 2587
Reputation: 2120
the one line answer for your problem will be using apply function in pandas.
import re
df['detail'] = df['detail'].apply(lambda x: re.sub(r'(\d)\s+(\d)',r'\1\2', x))
Upvotes: 2
Reputation: 5372
This seems to work for your case:
>>> import re
>>> phrases = ["minimum $ 3 000 000 per destroyed house due to Earthquake",
"maximum USD 1 000 000 per house as respects Detroit",
"You own me $ 1 000 000 000 dollars" ]
>>> for phrase in phrases:
... re.sub(r'\s(\d{3})', r'\1', phrase)
...
'minimum $ 3000000 per destroyed house due to Earthquake'
'maximum USD 1000000 per house as respects Detroit'
'You own me $ 1000000000 dollars'
>>>
Tested also with this:
>>> re.sub(r'\s(\d{3})', r'\1', 'This is a really big number 3 000 434 132 131 111 333 right?')
'This is a really big number 3000434132131111333 right?'
>>>
Upvotes: 3