Aisha
Aisha

Reputation: 103

Replace all consecutive repeated letters ignoring specific words

I saw plenty of suggestions to remove consecutively repeated letters in a sentence either using re (regex) or .join in python, but I want to have exception for special words.

E.g.:

I want this sentence > sentence = 'hello, join this meeting heere using thiis lllink'

to be like this > 'hello, join this meeting here using this link'

knowing that I have this list of words to keep and ignore repetitive letters check: keepWord = ['Hello','meeting']

The two scripts I found useful are:

I have a solution, but I think there's a more compacted and efficient one. My solution for now is:

import itertools

sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']

new_sentence = ''

for word in sentence.split():
    if word not in keepWord:
        new_word = ''.join(c[0] for c in itertools.groupby(word))
        new_sentence = sentence +" " + new_word
    else:
        new_sentence = sentence +" " + word

Any suggestions?

Upvotes: 5

Views: 306

Answers (2)

alani
alani

Reputation: 13079

Although not especially compact, here is a reasonably simple example using regexp: the function subst will replace repeated characters with a single one, and then re.sub is used in order to call this for each word that it finds.

It is assumed here that because your example keepWord list (where first mentioned) has Hello in title case but the text has hello in lower case, that you want to perform a case-insensitive comparison against the list. So it will work equally whether your sentence contains Hello or hello.

import re

sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['Hello','meeting']

keepWord_s = set(word.lower() for word in keepWord)

def subst(match):
    word = match.group(0)
    return word if word.lower() in keepWord_s else re.sub(r'(.)\1+', r'\1', word)

print(re.sub(r'\b.+?\b', subst, sentence))

Gives:

hello, join this meeting here using this link

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627292

You may match the whole words from the keepWord list, and only replace sequences of two or more identical letters in other contexts:

import re
sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']
new_sentence = re.sub(fr"\b(?:{'|'.join(keepWord)})\b|([^\W\d_])\1+", lambda x: x.group(1) or x.group(), sentence)
print(new_sentence)
# => hello, join this meeting here using this link

See the Python demo

The regex will look like

\b(?:hello|meeting)\b|([^\W\d_])\1+

See the regex demo. If Group 1 matches, its value is returned, else, the full match (the word to keep) is put back.

Pattern details

  • \b(?:hello|meeting)\b - hello or meeting enclosed with word boundaries
  • | - or
  • ([^\W\d_]) - Group 1: any Unicode letter
  • \1+ - one or more backreferences to Group 1 value

Upvotes: 1

Related Questions