Reputation: 8470
For example I have string:
aacbbbqq
As the result I want to have following matches:
(aa, c, bbb, qq)
I know that I can write something like this:
([a]+)|([b]+)|([c]+)|...
But I think i's ugly and looking for better solution. I'm looking for regular expression solution, not self-written finite-state machines.
Upvotes: 38
Views: 40743
Reputation: 1
This raw solution may be usefull..
string = "helllllo worlddd hhiii "
i = 0
j = 1
b = ''
l = []
for a in range(len(string)-1):
if string[i] != string[j]:
j = j+1
i = j-1
if b:
l.append(b)
b = ''
elif string[i] == string[j]:
if j-i == 1:
b += string[i:j+1]
else:
b += string[i]
j = j+1
print(l)
output:
['lllll', 'ddd', 'hh', 'iii']
Upvotes: 0
Reputation: 21
You can try something like this:
import re
string = 'aacbbbqq'
result = re.findall(r'((\w)\2*?)', string)
output = [x[0] for x in result]
print(output)
Output will be :
['aa', 'c', 'bbb', 'qq']
Upvotes: 2
Reputation: 2614
You can use:
re.sub(r"(\w)\1*", r'\1', 'tessst')
The output would be:
'test'
Upvotes: 4
Reputation: 1578
The findall method will work if you capture the back-reference like so:
result = [match[1] + match[0] for match in re.findall(r"(.)(\1*)", string)]
Upvotes: 5
Reputation: 95901
The trick is to match a single char of the range you want, and then make sure you match all repetitions of the same character:
>>> matcher= re.compile(r'(.)\1*')
This matches any single character (.
) and then its repetitions (\1*
) if any.
For your input string, you can get the desired output as:
>>> [match.group() for match in matcher.finditer('aacbbbqq')]
['aa', 'c', 'bbb', 'qq']
NB: because of the match group, re.findall
won't work correctly.
In case you don't want to match any character, change accordingly the .
in the regular expression:
>>> matcher= re.compile(r'([a-z])\1*') # only lower case ASCII letters
>>> matcher= re.compile(r'(?i)([a-z])\1*') # only ASCII letters
>>> matcher= re.compile(r'(\w)\1*') # ASCII letters or digits or underscores
>>> matcher= re.compile(r'(?u)(\w)\1*') # against unicode values, any letter or digit known to Unicode, or underscore
Check the latter against u'hello²²'
(Python 2.x) or 'hello²²'
(Python 3.x):
>>> text= u'hello=\xb2\xb2'
>>> print('\n'.join(match.group() for match in matcher.finditer(text)))
h
e
ll
o
²²
\w
against non-Unicode strings / bytearrays might be modified if you first have issued a locale.setlocale
call.
Upvotes: 26
Reputation: 9415
This will work, see a working example here: http://www.rubular.com/r/ptdPuz0qDV
(\w)\1*
Upvotes: 8
Reputation: 31951
itertools.groupby
is not a RexExp, but it's not self-written either. :-) A quote from python docs:
# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
Upvotes: 26