Reputation: 4979
Using python's re module, I'm trying to get the dollar values from statements such as:
The pattern below works for single values but if there are ranges (like in the first and last dot point above) it only gives me the last number (i.e. 349950 and 510000).
_pattern = r"""(?x)
^
.*
(?P<target1>
[€$£]
\d{1,3}
[,.]?
\d{0,3}
(?:[,.]\d{3})*
(?P<multiplyer1>[kKmM]?\s?[mM]?)
)
(?:\s(?:\-|\band\b|\bto\b)\s)?
(?P<target2>
[€$£]
\d{1,3}
[,.]?
\d{0,3}
(?:[,.]\d{3})*
(?P<multiplyer2>[kKmM]?\s?[mM]?)
)?
.*?
$
"""
When trying target2 = match.group("target2").strip()
target2 always appears to be None
.
I'm by no means a regexpert but can't really see what I'm doing wrong here. The multiplyer group works and to me it seems that the target2 group is the same pattern, i.e. and optional match at the end.
I hope I'm phrasing this somewhat understandably...
Upvotes: 3
Views: 860
Reputation: 43169
You could come up with some regex logic combined with a function converting the abbreviated numbers. Here's some example python code:
# -*- coding: utf-8> -*-
import re, locale
from locale import *
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
string = """"$305,000 - $349,950"
"Mid $2M's Buyers"
"... Buyers Guide $1.29M+"
"...$485,000 and $510,000"
"""
def convert_number(number, unit):
if unit == "K":
exp = 10**3
elif unit == "M":
exp = 10**6
return (atof(number) * exp)
matches = []
rx = r"""
\$(?P<value>\d+[\d,.]*) # match a dollar sign
# followed by numbers, dots and commas
# make the first digit necessary (+)
(?P<unit>M|K)? # match M or K and save it to a group
( # opening parenthesis
\s(?:-|and)\s # match a whitespace, dash or "and"
\$(?P<value1>\d+[\d,.]*) # the same pattern as above
(?P<unit1>M|K)?
)? # closing parethesis,
# make the whole subpattern optional (?)
"""
for match in re.finditer(rx, string, re.VERBOSE):
if match.group('unit') is not None:
value1 = convert_number(match.group('value'), match.group('unit'))
else:
value1 = atof(match.group('value'))
m = (value1)
if match.group('value1') is not None:
if match.group('unit1') is not None:
value2 = convert_number(match.group('value1'), match.group('unit1'))
else:
value2 = atof(match.group('value1'))
m = (value1, value2)
matches.append(m)
print matches
# [(305000.0, 349950.0), 2000000.0, 1290000.0, (485000.0, 510000.0)]
The code uses quite some logic, it first imports the locale
module for the atof()
function, defines a function convert_number()
and searches for ranges with a regular expression which is explained in the code. You could obviously add other currency symbols like €$£
but they weren't in your original examples.
Upvotes: 3
Reputation: 4418
+1 for using verbose mode for the regex pattern
The .*
at the beginning of the pattern is greedy, so it tries to match the entire line. Then it backtracks to match target1. Everything else in the pattern is optional, so matching target1 to the last match on the line is a successful match. You can try making the first .*
not greedy by adding a '?' like so:
_pattern = r"""(?x)
^
.*? <-- add the ?
(?P<target1>
... snip ...
"""
Can you do it incrementally?
_pattern = r"""(?x)
(?P<target1>
[€$£]
\d{1,3}
[,.]?
\d{0,3}
(?:[,.]\d{3})*
(?P<multiplyer1>[kKmM]?\s?[mM]?)
)
(?P<more>\s(?:\-|\band\b|\bto\b)\s)?
"""
match = re.search(_pattern, line)
target1, more = match.groups()
if more:
target2 = re.search(_pattern, line, start=match.end())
Edit One more thought: try re.findall():
_pattern = r"""(?x)
(?P<target1>
[€$£]
\d{1,3}
[,.]?
\d{0,3}
(?:[,.]\d{3})*
(?P<multiplyer1>[kKmM]?\s?[mM]?)
)
"""
targets = re.findall(_pattern, line)
Upvotes: 2