silvercoder
silvercoder

Reputation: 149

How to write regex to capture specific number formats and exclude the rest?

I am trying to capture limited true cases from a string with many other invalid number cases in Python Regex. The true cases are effectively valid number format with commas or number with commas and decimal. Everything else is invalid. Sample is below.

Sample input string:

input = "The net value is 1,000,000.00 however even 100,000 or 1,000,000 is acceptable. The amounts that are not acceptable are 1 000,000.00 or 1.000.000.00 or 1,000,000,00 or 1,000,000,0000"

Output is: 1,000,000.00 100,000 1,000,000

The current python regex I tried is as follows:

\d{1,3}(,{1}\d{3})*(\.{1}\d+){0,1}$

This only works when the input is just numbers. When I try to input numbers with words around them it fails.

Upvotes: 2

Views: 165

Answers (2)

The fourth bird
The fourth bird

Reputation: 163227

Another option is to rule out that there are only zeroes before the first comma using a negative lookahead, and match at least a single comma after the value as your desired output is 1,000,000.00 100,000 1,000,000

(?<!\S)(?!0+\,)\d{1,3}(?:,\d{3})+(?:\.\d+)?(?!\S)

Explanation

  • (?<!\S) Assert a whitespace boundary to the left
  • (?!0+\,) Assert not only zeroes before the first comma
  • \d{1,3} Match 1-3 digits
  • (?:,\d{3})+ Repeat 1+ times matching a comma and 1-3 digits
  • (?:\.\d+)? Optionally match a dot and 1+ digits
  • (?!\S) Assert a whitespace boundary at the right

Regex demo | Python demo

Example

import re
 
input = "The net value is 1,000,000.00 however even 100,000 or 1,000,000 is acceptable. The amounts that are not acceptable are 1 000,000.00 or 1.000.000.00 or 1,000,000,00 or 1,000,000,0000"
regex = r"(?<!\S)(?!0+\,)\d{1,3}(?:,\d{3})+(?:\.\d+)?(?!\S)"
 
print(re.findall(regex, input))

Output

['1,000,000.00', '100,000', '1,000,000']

Upvotes: 2

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520998

The following regex pattern gets closer to what you want here:

(?<!\S)[1-9]\d{0,2}(?:,\d{3})*(?:\.\d+)?(?!\S)

This uses lookarounds to assert that boundaries for the numbers must be either whitespace or the start/end of the input. Also note that we insist that each valid number not start with zero.

I would use re.findall as follows:

inp = "The net value is 1,000,000.00 however even 100,000 or 1,000,000 is acceptable. The amounts that are not acceptable are 1 000,000.00 or 1.000.000.00 or 1,000,000,00 or 1,000,000,0000"
matches = re.findall(r'(?<!\S)[1-9]\d{0,2}(?:,\d{3})*(?:\.\d+)?(?!\S)', inp)
print(matches)

This prints:

['1,000,000.00', '100,000', '1,000,000', '1']

As a note on why 1 appears as a result above, there is no obvious way to know that the stanadalone number 1 is actually part of the broken one million number.

Upvotes: 4

Related Questions