Removing Characters With Regular Expression in List Comprehension in Python

Question

I am learning python and I am trying to do some text preprocessing and I have been reading and borrowing ideas from Stackoverflow. I was able to come up with the following formulations below, but they don't appear to do what I was expecting, and they don't throw any errors either, so I'm stumped.

First, in a Pandas dataframe column, I am trying to remove the third consecutive character in a word; it's kind of like running a spell check on words that are supposed to have two consecutive characters instead of three

buttter = butter
bettter = better
ladder = ladder

The code I used is below:

import re
docs['Comments'] = [c for c in docs['Comments'] if re.sub(r'(\w)\1{2,}', r'\1', c)]

In the second instance, I just want to to replace multiple punctuations with the last one.

????? = ?
..... = .
!!!!! = !
----  = -
***** = *

And the code I have for that is:

docs['Comments'] = [i for i in docs['Comments'] if re.sub(r'[\?\.\!\*]+(?=[\?\.\!\*])', '', i)]

Wiktor Stribiżew · Accepted Answer

It looks like you want to use

docs['Comments'] = docs['Comments'].str.replace(r'(\w)\1{2,}', r'\1\1', regex=True)
    .str.replace(r'([^\w\s]|_)(\1)+', r'\2', regex=True)

The r'(\w)\1{2,}' regex finds three or more repeated word chars and \1\1 replaces with two their occurrences. See this regex demo.

The r'([^\w\s]|_)(\1)+' regex matches repeated punctuation chars and captures the last into Group 2, so \2 replaces the match with the last punctuation char. See this regex demo.

Removing Characters With Regular Expression in List Comprehension in Python

Answers (1)

Related Questions