PoorGolfer
PoorGolfer

Reputation: 38

Struggling with Regex for adjacent letters differing by case

I am looking to be able to recursively remove adjacent letters in a string that differ only in their case e.g. if s = AaBbccDd i would want to be able to remove Aa Bb Dd but leave cc.

I can do this recursively using lists:

I think it aught to be able to be done using regex but i am struggling:

with test string 'fffAaaABbe' the answer should be 'fffe' but the regex I am using gives 'fe'

def test(line):
    res = re.compile(r'(.)\1{1}', re.IGNORECASE)
    #print(res.search(line))
    while res.search(line):
        line = res.sub('', line, 1)
    print(line)

The way that works is:

def test(line):
    result =''
    chr = list(line)
    cnt = 0
    i = len(chr) - 1

    while i > 0:
        if ord(chr[i]) == ord(chr[i - 1]) + 32 or ord(chr[i]) == ord(chr[i - 1]) - 32:
            cnt += 1
            chr.pop(i)
            chr.pop(i - 1)
            i -= 2
        else:
            i -= 1
    if cnt > 0: # until we can't find any duplicates.
        return test(''.join(chr))
    result = ''.join(chr)
    print(result) 

Is it possible to do this using a regex?

Upvotes: 0

Views: 60

Answers (3)

CinchBlue
CinchBlue

Reputation: 6190

r'(.)\1{1}' is wrong because it will match any character that is repeated twice, including non-letter characters. If you want to stick to letters, you can't use this.

However, even if we just do r'[A-z]\1{1}', this would still be bad because you would match any sequence of the same letter twice, but it would catch xx and XX -- you don't want to match consecutive same characters with matching case, as you said in the original question.

It just so happens that there is no short-hand to do this conveniently, but it is still possible. You could also just write a small function to turn it into a short-hand.

Building on @Daweo's answer, you can generate the regex pattern needed to match pairs of same letters with non-matching case to get the final pattern of aA|Aa|bB|Bb|cC|Cc|dD|Dd|eE|Ee|fF|Ff|gG|Gg|hH|Hh|iI|Ii|jJ|Jj|kK|Kk|lL|Ll|mM|Mm|nN|Nn|oO|Oo|pP|Pp|qQ|Qq|rR|Rr|sS|Ss|tT|Tt|uU|Uu|vV|Vv|wW|Ww|xX|Xx|yY|Yy|zZ|Zz:

import re
import string

def consecutiveLettersNonMatchingCase():
  # Get all 'xX|Xx' with a list comprehension
  # and join them with '|'
  return '|'.join(['{0}{1}|{1}{0}'.format(s, t)\
    # Iterate through the upper/lowercase characters
    # in lock-step
    for s, t in zip(
      string.ascii_lowercase,
      string.ascii_uppercase)])

def test(line):
    res = re.compile(consecutiveLettersNonMatchingCase())
    print(res.search(line))
    while res.search(line):
        line = res.sub('', line, 1)
    print(line)

print(consecutiveLettersNonMatchingCase())

Upvotes: 0

Daweo
Daweo

Reputation: 36520

re.IGNORECASE is not way to solve this problem, as it will treat aa, Aa, aA, AA same way. Technically it is possible using re.sub, following way.

import re
txt = 'fffAaaABbe'
after_sub = re.sub(r'Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ', '', txt)
print(after_sub)  # fffe

Note that I explicitly defined all possible letters pairs, because so far I know there is no way to say "inverted case letter" using just re pattern. Maybe other user will be able to provide more concise re-based solution.

Upvotes: 1

Austin
Austin

Reputation: 26039

I suggest a different approach which uses groupby to group adjacent similar letters:

from itertools import groupby

def test(line):
    res = []
    for k, g in groupby(line, key=lambda x: x.lower()):
        g = list(g)
        if all(x == x.lower() for x in g):
            res.append(''.join(g))             
    print(''.join(res))

Sample run:

>>> test('AaBbccDd')
cc
>>> test('fffAaaABbe')
fffe

Upvotes: 0

Related Questions