GSA
GSA

Reputation: 813

Removing Characters With Regular Expression in List Comprehension in Python

I am learning python and I am trying to do some text preprocessing and I have been reading and borrowing ideas from Stackoverflow. I was able to come up with the following formulations below, but they don't appear to do what I was expecting, and they don't throw any errors either, so I'm stumped.

First, in a Pandas dataframe column, I am trying to remove the third consecutive character in a word; it's kind of like running a spell check on words that are supposed to have two consecutive characters instead of three

buttter = butter
bettter = better
ladder = ladder

The code I used is below:

import re
docs['Comments'] = [c for c in docs['Comments'] if re.sub(r'(\w)\1{2,}', r'\1', c)]

In the second instance, I just want to to replace multiple punctuations with the last one.

????? = ?
..... = .
!!!!! = !
----  = -
***** = *

And the code I have for that is:

docs['Comments'] = [i for i in docs['Comments'] if re.sub(r'[\?\.\!\*]+(?=[\?\.\!\*])', '', i)]

Upvotes: 1

Views: 174

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627507

It looks like you want to use

docs['Comments'] = docs['Comments'].str.replace(r'(\w)\1{2,}', r'\1\1', regex=True)
    .str.replace(r'([^\w\s]|_)(\1)+', r'\2', regex=True)

The r'(\w)\1{2,}' regex finds three or more repeated word chars and \1\1 replaces with two their occurrences. See this regex demo.

The r'([^\w\s]|_)(\1)+' regex matches repeated punctuation chars and captures the last into Group 2, so \2 replaces the match with the last punctuation char. See this regex demo.

Upvotes: 1

Related Questions