Replace all consecutive repeated letters ignoring specific words

Question

I saw plenty of suggestions to remove consecutively repeated letters in a sentence either using re (regex) or .join in python, but I want to have exception for special words.

E.g.:

I want this sentence > sentence = 'hello, join this meeting heere using thiis lllink'

to be like this > 'hello, join this meeting here using this link'

knowing that I have this list of words to keep and ignore repetitive letters check: keepWord = ['Hello','meeting']

The two scripts I found useful are:

Using .join:

import itertools

sentence = ''.join(c[0] for c in itertools.groupby(sentence))

Using regex:

import re

sentence = re.compile(r'(.)\1{1,}').sub(r'\1', sentence)

I have a solution, but I think there's a more compacted and efficient one. My solution for now is:

import itertools

sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']

new_sentence = ''

for word in sentence.split():
    if word not in keepWord:
        new_word = ''.join(c[0] for c in itertools.groupby(word))
        new_sentence = sentence +" " + new_word
    else:
        new_sentence = sentence +" " + word

Any suggestions?

Wiktor Stribiżew · Accepted Answer

You may match the whole words from the keepWord list, and only replace sequences of two or more identical letters in other contexts:

import re
sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']
new_sentence = re.sub(fr"\b(?:{'|'.join(keepWord)})\b|([^\W\d_])\1+", lambda x: x.group(1) or x.group(), sentence)
print(new_sentence)
# => hello, join this meeting here using this link

See the Python demo

The regex will look like

\b(?:hello|meeting)\b|([^\W\d_])\1+

See the regex demo. If Group 1 matches, its value is returned, else, the full match (the word to keep) is put back.

Pattern details

\b(?:hello|meeting)\b - hello or meeting enclosed with word boundaries
| - or
([^\W\d_]) - Group 1: any Unicode letter
\1+ - one or more backreferences to Group 1 value

Replace all consecutive repeated letters ignoring specific words

Answers (2)

Related Questions