JarochoEngineer
JarochoEngineer

Reputation: 1787

Remove duplicated letters except in abbreviations

I'd like to remove duplicated letters from a string as long as there are more letters. For instance, consider the following list:

aaa --> it is untouched because all are the same letters
aa  --> it is untouched because all are the same letters
a   --> not touched, just one letter
broom --> brom
school --> schol
boo --> should be bo
gool --> gol
ooow  --> should be ow

I use the following regex to get rid of the duplicates as follows:

(?<=[a-zA-Z])([a-zA-Z])\1+(?=[a-zA-Z])

However, this is failing in the string boo which is kept as the original boo instead of removing the double o. The same happens with oow which is not reduced to ow.

Do you know why boo is not taken by the regex?

Upvotes: 1

Views: 103

Answers (2)

Jolbas
Jolbas

Reputation: 752

You regular expression dosen't match boo because it searches for a duplicate that has at least one different character both before and after.

One possibility is to make a simpler regex to catch all duplicates and then revert if the result is one character

def remove_duplicate(string):
    new_string = re.sub(r'([a-zA-Z])\1+', r'\1', string)
    return new_string if len(new_string) > 1 else string

Here is a possible solution without regular expression. It's faster but it will remove duplicates of white space and punctuation too. Not only letters.

def remove_duplicate(string):
    new_string = ''
    last_c = None
    for c in string:
        if c == last_c:
            continue
        else:
            new_string += c
            last_c = c
    if len(new_string) > 1:
        return new_string
    else:
        return string

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627292

You can match and capture whole words consisting of identical chars into one capturing group, and then match repetitive consecutive letters in all other contexts, and replace accordingly:

import re
text = "aaa, aa, a,broom, school...boo, gool, ooow."
print( re.sub(r'\b(([a-zA-Z])\2+)\b|([a-zA-Z])\3+', r'\1\3', text) )
# => aaa, aa, a,brom, schol...bo, gol, ow.

See the Python demo and the regex demo.

Regex details

  • \b - a word boundary
  • (([a-zA-Z])\2+) - Group 1: an ASCII letter (captured into Group 2) and then one or more occurrences of the same letter
  • \b - a word boundary
  • | - or
  • ([a-zA-Z]) - Group 3: an ASCII letter captured into Group 3
  • \3+ - one or more occurrences of the letter captured in Group 3.

The replacement is a concatenation of Group 1 and Group 3 values.

To match any Unicode letters, replace [a-zA-Z] with [^\W\d_].

Upvotes: 1

Related Questions