Data cleaning with pandas using regular expressions

Question

I have several regexps like this,

Data['SUMMARY']=Data['SUMMARY'].str.replace(r'([^\w])',' ')
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'x{2,}',' ')
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'_+',' ')
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'\d+',' ')
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'\s{2,}',' ')

i want to replace all punctuations, XXXXXXXX,all digits, all non alphanumeric to the empty string ''. How can I combine it all into one replacing regexp?

cs95 · Accepted Answer

So you want to remove (based on your question)

punctuation
X{2,}
digits
anything that is not a letter or digit

There are overlapping themes here. You're looking to retain only letters and single whitespaces. You can condense your separate patterns to a single one -

df = pd.DataFrame({'SUMMARY' : ['hello, world!', 'XXXXX test', '123four, five:; six...']})

df

                  SUMMARY
0           hello, world!
1              XXXXX test
2  123four, five:; six...

df.SUMMARY.str.replace(r'[^a-zA-Z\s]+|X{2,}', '')

0      hello world
1             test
2    four five six
Name: SUMMARY, dtype: object

If your column has two or more spaces, you'll have to make a separate call and replace them.

df.SUMMARY = df.SUMMARY.str.replace(r'[^a-zA-Z\s]+|X{2,}', '')\
                       .str.replace(r'\s{2,}', ' ')

Data cleaning with pandas using regular expressions

Answers (2)

Related Questions