Remove duplicated letters except in abbreviations

Question

I'd like to remove duplicated letters from a string as long as there are more letters. For instance, consider the following list:

aaa --> it is untouched because all are the same letters
aa  --> it is untouched because all are the same letters
a   --> not touched, just one letter
broom --> brom
school --> schol
boo --> should be bo
gool --> gol
ooow  --> should be ow

I use the following regex to get rid of the duplicates as follows:

(?<=[a-zA-Z])([a-zA-Z])\1+(?=[a-zA-Z])

However, this is failing in the string boo which is kept as the original boo instead of removing the double o. The same happens with oow which is not reduced to ow.

Do you know why boo is not taken by the regex?

Wiktor Stribiżew · Accepted Answer

You can match and capture whole words consisting of identical chars into one capturing group, and then match repetitive consecutive letters in all other contexts, and replace accordingly:

import re
text = "aaa, aa, a,broom, school...boo, gool, ooow."
print( re.sub(r'\b(([a-zA-Z])\2+)\b|([a-zA-Z])\3+', r'\1\3', text) )
# => aaa, aa, a,brom, schol...bo, gol, ow.

See the Python demo and the regex demo.

Regex details

\b - a word boundary
(([a-zA-Z])\2+) - Group 1: an ASCII letter (captured into Group 2) and then one or more occurrences of the same letter
\b - a word boundary
| - or
([a-zA-Z]) - Group 3: an ASCII letter captured into Group 3
\3+ - one or more occurrences of the letter captured in Group 3.

The replacement is a concatenation of Group 1 and Group 3 values.

To match any Unicode letters, replace [a-zA-Z] with [^\W\d_].

Remove duplicated letters except in abbreviations

Answers (2)

Related Questions