Reputation: 1235
I have a string:
string = u'11a2ee22b333c44d5e66e777e8888'
I want to find all k
consecutive chunks of digits where n <= k <= m
.
Using regular expression only:
say for example n=2
and m=3
using (?:\D|^)(\d{2,3})(?:\D|$)
re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888')
Gives this output:
['11', '333', '66']
Desired output:
['11', '22', '333', '44', '66', '777']
I know there are alternate solutions like:
filter(lambda x: re.match('^\d{2,3}$', x), re.split(u'\D',r'11a2ee22b333c44d5e66e777e8888'))
which gives the desired output, but I want to know what's wrong with the first approach?
It seems re.findall
goes in sequence and skips the previous part when matched, so what can be done?
Upvotes: 1
Views: 660
Reputation: 43169
You could even generalize it with a function:
import re
string = "11a2ee22b333c44d5e66e777e8888"
def numbers(n,m):
rx = re.compile(r'(?<!\d)(\d{' + '{},{}'.format(n,m) + '})(?!\d)')
return rx.findall(string)
print(numbers(2,3))
# ['11', '22', '333', '44', '66', '777']
Upvotes: 1
Reputation: 4554
lookaround regex,\d{2,3} means 2 or 3 digits, (?=[a-z]) means letter after digits.
In [136]: re.findall(r'(\d{2,3})(?=[a-z])',string)
Out[136]: ['11', '22', '333', '44', '66', '777']
Upvotes: 1
Reputation: 59118
Note: The result you show in your question is not what I'm getting:
>>> import re
>>> re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888')
[u'11', u'22', u'44', u'66']
It's still missing some of the matches you want, but not the same ones.
The problem is that even though non-capturing groups like (?:\D|^)
and (?:\D|$)
don't capture what they match, they still consume it.
This means that the match which yields '22'
has actually consumed:
e
, with (?:\D|^)
– not captured (but still consumed)22
with (\d{2,3})
– capturedb
with (?:\D|$)
– not captured (but still consumed)… so that b
is no longer available to be matched before 333
.
You can get the result you want with lookbehind and lookahead syntax:
>>> re.findall(u'(?<!\d)\d{2,3}(?!\d)',u'11a2ee22b333c44d5e66e777e8888')
[u'11', u'22', u'333', u'44', u'66', u'777']
Here, (?<!\d)
is a negative lookbehind, checking that the match is not preceded by a digit, and (?!\d)
is a negative lookahead, checking that the match is not followed by a digit. Crucially, these constructions do not consume any of the string.
The various lookahead and lookbehind constructions are described in the
Regular Expression Syntax section of Python's re
documentation.
Upvotes: 2