Devi Prasad Khatua
Devi Prasad Khatua

Reputation: 1235

Match a specific number of digits not preceded or followed by digits

I have a string:

string = u'11a2ee22b333c44d5e66e777e8888'

I want to find all k consecutive chunks of digits where n <= k <= m.

Using regular expression only: say for example n=2 and m=3 using (?:\D|^)(\d{2,3})(?:\D|$)

re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888')

Gives this output:

['11', '333', '66']

Desired output:

['11', '22', '333', '44', '66', '777']

I know there are alternate solutions like:

filter(lambda x: re.match('^\d{2,3}$', x), re.split(u'\D',r'11a2ee22b333c44d5e66e777e8888'))

which gives the desired output, but I want to know what's wrong with the first approach?

It seems re.findall goes in sequence and skips the previous part when matched, so what can be done?

Upvotes: 1

Views: 660

Answers (3)

Jan
Jan

Reputation: 43169

You could even generalize it with a function:

import re

string = "11a2ee22b333c44d5e66e777e8888"

def numbers(n,m):
    rx = re.compile(r'(?<!\d)(\d{' + '{},{}'.format(n,m) + '})(?!\d)')
    return rx.findall(string)

print(numbers(2,3))
# ['11', '22', '333', '44', '66', '777']

Upvotes: 1

Shenglin Chen
Shenglin Chen

Reputation: 4554

lookaround regex,\d{2,3} means 2 or 3 digits, (?=[a-z]) means letter after digits.

In [136]: re.findall(r'(\d{2,3})(?=[a-z])',string)
Out[136]: ['11', '22', '333', '44', '66', '777']

Upvotes: 1

schesis
schesis

Reputation: 59118

Note: The result you show in your question is not what I'm getting:

>>> import re
>>> re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888')
[u'11', u'22', u'44', u'66']

It's still missing some of the matches you want, but not the same ones.

The problem is that even though non-capturing groups like (?:\D|^) and (?:\D|$) don't capture what they match, they still consume it.

This means that the match which yields '22' has actually consumed:

  1. e, with (?:\D|^) – not captured (but still consumed)
  2. 22 with (\d{2,3}) – captured
  3. b with (?:\D|$) – not captured (but still consumed)

… so that b is no longer available to be matched before 333.

You can get the result you want with lookbehind and lookahead syntax:

>>> re.findall(u'(?<!\d)\d{2,3}(?!\d)',u'11a2ee22b333c44d5e66e777e8888')
[u'11', u'22', u'333', u'44', u'66', u'777']

Here, (?<!\d) is a negative lookbehind, checking that the match is not preceded by a digit, and (?!\d) is a negative lookahead, checking that the match is not followed by a digit. Crucially, these constructions do not consume any of the string.

The various lookahead and lookbehind constructions are described in the Regular Expression Syntax section of Python's re documentation.

Upvotes: 2

Related Questions