Reputation: 1787
I'd like to remove duplicated letters from a string as long as there are more letters. For instance, consider the following list:
aaa --> it is untouched because all are the same letters
aa --> it is untouched because all are the same letters
a --> not touched, just one letter
broom --> brom
school --> schol
boo --> should be bo
gool --> gol
ooow --> should be ow
I use the following regex to get rid of the duplicates as follows:
(?<=[a-zA-Z])([a-zA-Z])\1+(?=[a-zA-Z])
However, this is failing in the string boo
which is kept as the original boo
instead of removing the double o. The same happens with oow
which is not reduced to ow
.
Do you know why boo
is not taken by the regex?
Upvotes: 1
Views: 103
Reputation: 752
You regular expression dosen't match boo because it searches for a duplicate that has at least one different character both before and after.
One possibility is to make a simpler regex to catch all duplicates and then revert if the result is one character
def remove_duplicate(string):
new_string = re.sub(r'([a-zA-Z])\1+', r'\1', string)
return new_string if len(new_string) > 1 else string
Here is a possible solution without regular expression. It's faster but it will remove duplicates of white space and punctuation too. Not only letters.
def remove_duplicate(string):
new_string = ''
last_c = None
for c in string:
if c == last_c:
continue
else:
new_string += c
last_c = c
if len(new_string) > 1:
return new_string
else:
return string
Upvotes: 1
Reputation: 627292
You can match and capture whole words consisting of identical chars into one capturing group, and then match repetitive consecutive letters in all other contexts, and replace accordingly:
import re
text = "aaa, aa, a,broom, school...boo, gool, ooow."
print( re.sub(r'\b(([a-zA-Z])\2+)\b|([a-zA-Z])\3+', r'\1\3', text) )
# => aaa, aa, a,brom, schol...bo, gol, ow.
See the Python demo and the regex demo.
Regex details
\b
- a word boundary(([a-zA-Z])\2+)
- Group 1: an ASCII letter (captured into Group 2) and then one or more occurrences of the same letter\b
- a word boundary|
- or([a-zA-Z])
- Group 3: an ASCII letter captured into Group 3\3+
- one or more occurrences of the letter captured in Group 3.The replacement is a concatenation of Group 1 and Group 3 values.
To match any Unicode letters, replace [a-zA-Z]
with [^\W\d_]
.
Upvotes: 1