regex - matching number with optional range

Question

Using python's re module, I'm trying to get the dollar values from statements such as:

"$305,000 - $349,950" should give a tuple like this (305000, 349950)
"Mid $2M's Buyers" --> (2000000)
"... Buyers Guide $1.29M+" --> (1290000)
"...$485,000 and $510,000" --> (485000, 510000)

The pattern below works for single values but if there are ranges (like in the first and last dot point above) it only gives me the last number (i.e. 349950 and 510000).

_pattern = r"""(?x)
    ^
    .*
    (?P
        [€$£]
        \d{1,3}
        [,.]?
        \d{0,3}
        (?:[,.]\d{3})*
        (?P[kKmM]?\s?[mM]?)
    )
    (?:\s(?:\-|\band\b|\bto\b)\s)?
    (?P
        [€$£]
        \d{1,3}
        [,.]?
        \d{0,3}
        (?:[,.]\d{3})*
        (?P[kKmM]?\s?[mM]?)
    )?
    .*?
    $
    """

When trying target2 = match.group("target2").strip() target2 always appears to be None.

I'm by no means a regexpert but can't really see what I'm doing wrong here. The multiplyer group works and to me it seems that the target2 group is the same pattern, i.e. and optional match at the end.

I hope I'm phrasing this somewhat understandably...

RootTwo · Accepted Answer

+1 for using verbose mode for the regex pattern

The .* at the beginning of the pattern is greedy, so it tries to match the entire line. Then it backtracks to match target1. Everything else in the pattern is optional, so matching target1 to the last match on the line is a successful match. You can try making the first .* not greedy by adding a '?' like so:

_pattern = r"""(?x)
    ^
    .*?                   <-- add the ?
    (?P
    ... snip ...
    """

Can you do it incrementally?

_pattern = r"""(?x)
    (?P
        [€$£]
        \d{1,3}
        [,.]?
        \d{0,3}
        (?:[,.]\d{3})*
        (?P[kKmM]?\s?[mM]?)
    )
    (?P\s(?:\-|\band\b|\bto\b)\s)?
    """

match = re.search(_pattern, line)
target1, more = match.groups()
if more:
    target2 = re.search(_pattern, line, start=match.end())

Edit One more thought: try re.findall():

_pattern = r"""(?x)
    (?P
        [€$£]
        \d{1,3}
        [,.]?
        \d{0,3}
        (?:[,.]\d{3})*
        (?P[kKmM]?\s?[mM]?)
    )
"""

targets = re.findall(_pattern, line)

regex - matching number with optional range

Answers (2)

Related Questions