Reputation: 149
I am trying to capture limited true cases from a string with many other invalid number cases in Python Regex. The true cases are effectively valid number format with commas or number with commas and decimal. Everything else is invalid. Sample is below.
Sample input string:
input = "The net value is 1,000,000.00 however even 100,000 or 1,000,000 is acceptable. The amounts that are not acceptable are 1 000,000.00 or 1.000.000.00 or 1,000,000,00 or 1,000,000,0000"
Output is: 1,000,000.00 100,000 1,000,000
The current python regex I tried is as follows:
\d{1,3}(,{1}\d{3})*(\.{1}\d+){0,1}$
This only works when the input is just numbers. When I try to input numbers with words around them it fails.
Upvotes: 2
Views: 165
Reputation: 163227
Another option is to rule out that there are only zeroes before the first comma using a negative lookahead, and match at least a single comma after the value as your desired output is 1,000,000.00 100,000 1,000,000
(?<!\S)(?!0+\,)\d{1,3}(?:,\d{3})+(?:\.\d+)?(?!\S)
Explanation
(?<!\S)
Assert a whitespace boundary to the left(?!0+\,)
Assert not only zeroes before the first comma\d{1,3}
Match 1-3 digits(?:,\d{3})+
Repeat 1+ times matching a comma and 1-3 digits(?:\.\d+)?
Optionally match a dot and 1+ digits(?!\S)
Assert a whitespace boundary at the rightExample
import re
input = "The net value is 1,000,000.00 however even 100,000 or 1,000,000 is acceptable. The amounts that are not acceptable are 1 000,000.00 or 1.000.000.00 or 1,000,000,00 or 1,000,000,0000"
regex = r"(?<!\S)(?!0+\,)\d{1,3}(?:,\d{3})+(?:\.\d+)?(?!\S)"
print(re.findall(regex, input))
Output
['1,000,000.00', '100,000', '1,000,000']
Upvotes: 2
Reputation: 520998
The following regex pattern gets closer to what you want here:
(?<!\S)[1-9]\d{0,2}(?:,\d{3})*(?:\.\d+)?(?!\S)
This uses lookarounds to assert that boundaries for the numbers must be either whitespace or the start/end of the input. Also note that we insist that each valid number not start with zero.
I would use re.findall
as follows:
inp = "The net value is 1,000,000.00 however even 100,000 or 1,000,000 is acceptable. The amounts that are not acceptable are 1 000,000.00 or 1.000.000.00 or 1,000,000,00 or 1,000,000,0000"
matches = re.findall(r'(?<!\S)[1-9]\d{0,2}(?:,\d{3})*(?:\.\d+)?(?!\S)', inp)
print(matches)
This prints:
['1,000,000.00', '100,000', '1,000,000', '1']
As a note on why 1
appears as a result above, there is no obvious way to know that the stanadalone number 1
is actually part of the broken one million number.
Upvotes: 4