Remove duplicated puntaction in a string

I'm working on a cleaning some text as the one bellow:

Great talking with you. ? See you, the other guys and Mr. Jack Daniels next  week, I hope-- ? Bobette ? ? Bobette  Riner???????????????????????????????   Senior Power Markets Analyst??????   TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell:  832/428-7008 [email protected] http://www.tradersnewspower.com ? ?  - cinhrly020101.doc

It has multiple spaces and question marks, to clean it I'm using regular expressions:

def remove_duplicate_characters(text):     
    text = re.sub("\s+"," ",text) 
    text = re.sub("\s*\?+","?",text)
    text = re.sub("\s*\?+","?",text)
    return text


remove_duplicate_characters(msg)



remove_duplicate_characters(msg)

Which gives me the following result:

'Great talking with you.? See you, the other guys and Mr. Jack Daniels next week, I hope--? Bobette? Bobette Riner? Senior Power Markets Analyst? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 [email protected] http://www.tradersnewspower.com? - cinhrly020101.doc'

For this particular case, it does work, but does not looks like the best approach if I want to add more charaters to remove. Is there an optimal way to solve this?

Upvotes: 2

Views: 62

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

To replace all consecutive punctuation chars with their single occurrence you can use

re.sub(r"([^\w\s]|_)\1+", r"\1", text)

If the leading whitespace must be removed, use the r"\s*([^\w\s]|_)\1+" regex.

See the regex demo online.

In case you want to introduce exceptions to this generic regex, you may add an alternative on the left where you'd capture all the contexts where you wat the consecutive punctuation to be kept:

re.sub(r'((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+', r'\1\2', text)

See this regex demo.

The ((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+ regex matches and captures a ... (not encosed with other dots on both ends) and a :// string (commonly seen in URLS), and the rest is the original regex with the adjusted backreference (since now, there are two capturing groups).

The \1\2 in the replacement pattern put back the captured vaues into the resulting string.

Upvotes: 3

Related Questions