son520804
son520804

Reputation: 483

Pandas removing whitespace in part of string

I am writing a regex code to remove the whitespace for financial values of the string in pandas dataframe. Let's suppose a pandas dataframe called df and a column "detail" is in the dataframe df.

For example, the column contains string like:

"maximum USD 1 000 000 per house as respects Detroit"

"minimum $ 3 000 000 per destroyed house due to Earthquake"

I tried using str.replace method in pandas, with regex applied. However,

df['detail'].str.replace(r"(USD)\s*(\d+\s*)+(\d+)", r"USD\s\d+\d+", regex=True)

df['detail'].str.replace(r"(\$)\s*(\d+\s*)+(\d+)", r"\$\s\d+\d+", regex=True)

will return

"maximum USD\s\d+\d+ per house as respects Detroit"

"minimum \$\s\d+\d+ per destroyed house due to Earthquake" rather than replacing the number with the right format.

The expected result is to remove the whitespace for financial values and return:

"maximum USD 1000000 per house as respects Detroit"

"minimum $ 3000000 per destroyed house due to Earthquake"

What code should I try to remove whitespace inside the financial values? The actual data is huge so manual work is going to be immense.

Solved:

Thanks for the answers, I used the following code and it works well. In big data it is possible for the USD/$ to have multiple whitespaces before the financial number (Or it could be 0), so \s+ or \s* provide a safe approach for this problem.

df['detail'] = df['detail'].apply(lambda x: re.sub(r'\s+(\d{3})', r'\1', x))

Upvotes: 2

Views: 2587

Answers (2)

secretive
secretive

Reputation: 2120

the one line answer for your problem will be using apply function in pandas.

import re
df['detail'] = df['detail'].apply(lambda x: re.sub(r'(\d)\s+(\d)',r'\1\2', x))

Upvotes: 2

accdias
accdias

Reputation: 5372

This seems to work for your case:

>>> import re
>>> phrases = ["minimum $ 3 000 000 per destroyed house due to Earthquake",
"maximum USD 1 000 000 per house as respects Detroit",
"You own me $ 1 000 000 000 dollars" ]
>>> for phrase in phrases:
...     re.sub(r'\s(\d{3})', r'\1', phrase)
... 
'minimum $ 3000000 per destroyed house due to Earthquake'
'maximum USD 1000000 per house as respects Detroit'
'You own me $ 1000000000 dollars'
>>> 

Tested also with this:

>>> re.sub(r'\s(\d{3})', r'\1', 'This is a really big number 3 000 434 132 131 111 333 right?')
'This is a really big number 3000434132131111333 right?'
>>> 

Upvotes: 3

Related Questions