Reputation: 43
Goal: Using regex, (not split) I would like to take a string of numbers and only return the "properly formatted" numbers. I define "properly formatted" as every three digits must be preceded by a comma.
My code:
import re
numRegex = re.compile(r'\b\d{1,3}(?:,\d{3})*\b')
print(numRegex.findall('42 1,234 6,368,745 12,34,567 1234'))
When I run the code I would expect to get:
['42', '1,234', '6,368,745']
Instead I get back:
['42', '1,234', '6,368',745', '12', '34,567']
I would guess it's treating the comma (,) as a boundary (\b), but I'm not sure how to get around this elegantly.
FYI: This example is an adaptation of the problem question from "Automate the Boring Stuff with Python: Practical Programming for Total Beginners". The example problem only requested a regex to figure out if an individual number is formatted correctly and didn't expect you to parse out all "properly formatted" numbers from a long string of multiple numbers. I misinterpreted the question initially and now I'm on a mission to finish it out this way.
Upvotes: 0
Views: 30
Reputation: 37237
Try negative lookarounds:
numRegex = re.compile(r'\b\d{1,3}(?:,\d{3})*\b(?!,)')
There's a lookahead assertion (?!,)
so that the boundary on the right side cannot be followed by a comma.
Similarly you can have lookbehind assertions that require the matched text to not be preceded by a comma:
numRegex = re.compile(r'(?<!,)\b\d{1,3}(?:,\d{3})*\b(?!,)')
This way when a "number" has a comma on its either side, it will not be matched.
Upvotes: 1