Sansa
Sansa

Reputation: 13

Data cleaning with pandas using regular expressions

I have several regexps like this,

Data['SUMMARY']=Data['SUMMARY'].str.replace(r'([^\w])',' ')
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'x{2,}',' ')
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'_+',' ')
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'\d+',' ')
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'\s{2,}',' ')

i want to replace all punctuations, XXXXXXXX,all digits, all non alphanumeric to the empty string ''. How can I combine it all into one replacing regexp?

Upvotes: 0

Views: 3752

Answers (2)

Allan
Allan

Reputation: 12456

if you want to replace lower and upper case 2 or more occurrences of x and if you also want to replace the spaces (other blank chars) by the empty string:

(?i)([^a-z]+|X{2,})

if you want to keep the blank characters and if you want to replace lower and upper case chains of 2 x or more use:

(?i)([^a-z\s]+|X{2,})

if you want to remove only the upper case chains of 2 X or more and keep the lower case chain of x:

([^a-zA-Z\s]+|X{2,})

Upvotes: 0

cs95
cs95

Reputation: 402962

So you want to remove (based on your question)

  1. punctuation
  2. X{2,}
  3. digits
  4. anything that is not a letter or digit

There are overlapping themes here. You're looking to retain only letters and single whitespaces. You can condense your separate patterns to a single one -

df = pd.DataFrame({'SUMMARY' : ['hello, world!', 'XXXXX test', '123four, five:; six...']})

df

                  SUMMARY
0           hello, world!
1              XXXXX test
2  123four, five:; six...

df.SUMMARY.str.replace(r'[^a-zA-Z\s]+|X{2,}', '')

0      hello world
1             test
2    four five six
Name: SUMMARY, dtype: object

If your column has two or more spaces, you'll have to make a separate call and replace them.

df.SUMMARY = df.SUMMARY.str.replace(r'[^a-zA-Z\s]+|X{2,}', '')\
                       .str.replace(r'\s{2,}', ' ')

Upvotes: 3

Related Questions