Reputation: 13
I have several regexps like this,
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'([^\w])',' ')
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'x{2,}',' ')
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'_+',' ')
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'\d+',' ')
Data['SUMMARY']=Data['SUMMARY'].str.replace(r'\s{2,}',' ')
i want to replace all punctuations, XXXXXXXX,all digits, all non alphanumeric to the empty string ''
. How can I combine it all into one replacing regexp?
Upvotes: 0
Views: 3752
Reputation: 12456
if you want to replace lower and upper case 2 or more occurrences
of x
and if you also want to replace the spaces (other blank chars) by the empty string:
(?i)([^a-z]+|X{2,})
if you want to keep the blank characters and if you want to replace lower and upper case chains of 2 x
or more use:
(?i)([^a-z\s]+|X{2,})
if you want to remove only the upper case chains of 2 X
or more and keep the lower case chain of x
:
([^a-zA-Z\s]+|X{2,})
Upvotes: 0
Reputation: 402962
So you want to remove (based on your question)
X{2,}
There are overlapping themes here. You're looking to retain only letters and single whitespaces. You can condense your separate patterns to a single one -
df = pd.DataFrame({'SUMMARY' : ['hello, world!', 'XXXXX test', '123four, five:; six...']})
df
SUMMARY
0 hello, world!
1 XXXXX test
2 123four, five:; six...
df.SUMMARY.str.replace(r'[^a-zA-Z\s]+|X{2,}', '')
0 hello world
1 test
2 four five six
Name: SUMMARY, dtype: object
If your column has two or more spaces, you'll have to make a separate call and replace them.
df.SUMMARY = df.SUMMARY.str.replace(r'[^a-zA-Z\s]+|X{2,}', '')\
.str.replace(r'\s{2,}', ' ')
Upvotes: 3