Reputation: 103
I saw plenty of suggestions to remove consecutively repeated letters in a sentence either using re (regex) or .join in python, but I want to have exception for special words.
E.g.:
I want this sentence > sentence = 'hello, join this meeting heere using thiis lllink'
to be like this > 'hello, join this meeting here using this link'
knowing that I have this list of words to keep and ignore repetitive letters check: keepWord = ['Hello','meeting']
The two scripts I found useful are:
Using .join:
import itertools
sentence = ''.join(c[0] for c in itertools.groupby(sentence))
Using regex:
import re
sentence = re.compile(r'(.)\1{1,}').sub(r'\1', sentence)
I have a solution, but I think there's a more compacted and efficient one. My solution for now is:
import itertools
sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']
new_sentence = ''
for word in sentence.split():
if word not in keepWord:
new_word = ''.join(c[0] for c in itertools.groupby(word))
new_sentence = sentence +" " + new_word
else:
new_sentence = sentence +" " + word
Any suggestions?
Upvotes: 5
Views: 306
Reputation: 13079
Although not especially compact, here is a reasonably simple example using regexp: the function subst
will replace repeated characters with a single one, and then re.sub
is used in order to call this for each word that it finds.
It is assumed here that because your example keepWord
list (where first mentioned) has Hello
in title case but the text has hello
in lower case, that you want to perform a case-insensitive comparison against the list. So it will work equally whether your sentence contains Hello
or hello
.
import re
sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['Hello','meeting']
keepWord_s = set(word.lower() for word in keepWord)
def subst(match):
word = match.group(0)
return word if word.lower() in keepWord_s else re.sub(r'(.)\1+', r'\1', word)
print(re.sub(r'\b.+?\b', subst, sentence))
Gives:
hello, join this meeting here using this link
Upvotes: 1
Reputation: 627292
You may match the whole words from the keepWord
list, and only replace sequences of two or more identical letters in other contexts:
import re
sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']
new_sentence = re.sub(fr"\b(?:{'|'.join(keepWord)})\b|([^\W\d_])\1+", lambda x: x.group(1) or x.group(), sentence)
print(new_sentence)
# => hello, join this meeting here using this link
See the Python demo
The regex will look like
\b(?:hello|meeting)\b|([^\W\d_])\1+
See the regex demo. If Group 1 matches, its value is returned, else, the full match (the word to keep) is put back.
Pattern details
\b(?:hello|meeting)\b
- hello
or meeting
enclosed with word boundaries|
- or([^\W\d_])
- Group 1: any Unicode letter\1+
- one or more backreferences to Group 1 valueUpvotes: 1