Reputation: 38
I am looking to be able to recursively remove adjacent letters in a string that differ only in their case e.g. if s = AaBbccDd i would want to be able to remove Aa Bb Dd but leave cc.
I can do this recursively using lists:
I think it aught to be able to be done using regex but i am struggling:
with test string 'fffAaaABbe' the answer should be 'fffe' but the regex I am using gives 'fe'
def test(line):
res = re.compile(r'(.)\1{1}', re.IGNORECASE)
#print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
The way that works is:
def test(line):
result =''
chr = list(line)
cnt = 0
i = len(chr) - 1
while i > 0:
if ord(chr[i]) == ord(chr[i - 1]) + 32 or ord(chr[i]) == ord(chr[i - 1]) - 32:
cnt += 1
chr.pop(i)
chr.pop(i - 1)
i -= 2
else:
i -= 1
if cnt > 0: # until we can't find any duplicates.
return test(''.join(chr))
result = ''.join(chr)
print(result)
Is it possible to do this using a regex?
Upvotes: 0
Views: 60
Reputation: 6190
r'(.)\1{1}'
is wrong because it will match any character that is repeated twice, including non-letter characters. If you want to stick to letters, you can't use this.
However, even if we just do r'[A-z]\1{1}'
, this would still be bad because you would match any sequence of the same letter twice, but it would catch xx
and XX
-- you don't want to match consecutive same characters with matching case, as you said in the original question.
It just so happens that there is no short-hand to do this conveniently, but it is still possible. You could also just write a small function to turn it into a short-hand.
Building on @Daweo's answer, you can generate the regex pattern needed to match pairs of same letters with non-matching case to get the final pattern of aA|Aa|bB|Bb|cC|Cc|dD|Dd|eE|Ee|fF|Ff|gG|Gg|hH|Hh|iI|Ii|jJ|Jj|kK|Kk|lL|Ll|mM|Mm|nN|Nn|oO|Oo|pP|Pp|qQ|Qq|rR|Rr|sS|Ss|tT|Tt|uU|Uu|vV|Vv|wW|Ww|xX|Xx|yY|Yy|zZ|Zz
:
import re
import string
def consecutiveLettersNonMatchingCase():
# Get all 'xX|Xx' with a list comprehension
# and join them with '|'
return '|'.join(['{0}{1}|{1}{0}'.format(s, t)\
# Iterate through the upper/lowercase characters
# in lock-step
for s, t in zip(
string.ascii_lowercase,
string.ascii_uppercase)])
def test(line):
res = re.compile(consecutiveLettersNonMatchingCase())
print(res.search(line))
while res.search(line):
line = res.sub('', line, 1)
print(line)
print(consecutiveLettersNonMatchingCase())
Upvotes: 0
Reputation: 36520
re.IGNORECASE
is not way to solve this problem, as it will treat aa
, Aa
, aA
, AA
same way. Technically it is possible using re.sub
, following way.
import re
txt = 'fffAaaABbe'
after_sub = re.sub(r'Aa|aA|Bb|bB|Cc|cC|Dd|dD|Ee|eE|Ff|fF|Gg|gG|Hh|hH|Ii|iI|Jj|jJ|Kk|kK|Ll|lL|Mm|mM|Nn|nN|Oo|oO|Pp|pP|Qq|qQ|Rr|rR|Ss|sS|Tt|tT|Uu|uU|Vv|vV|Ww|wW|Xx|xX|Yy|yY|Zz|zZ', '', txt)
print(after_sub) # fffe
Note that I explicitly defined all possible letters pairs, because so far I know there is no way to say "inverted case letter" using just re
pattern. Maybe other user will be able to provide more concise re
-based solution.
Upvotes: 1
Reputation: 26039
I suggest a different approach which uses groupby
to group adjacent similar letters:
from itertools import groupby
def test(line):
res = []
for k, g in groupby(line, key=lambda x: x.lower()):
g = list(g)
if all(x == x.lower() for x in g):
res.append(''.join(g))
print(''.join(res))
Sample run:
>>> test('AaBbccDd')
cc
>>> test('fffAaaABbe')
fffe
Upvotes: 0