Vanj
Vanj

Reputation: 55

How to write a regex on python to find values between 2 and 2,000,000,000?

I'm trying to write a regex that will find currency values in my text. I have values that vary from 2 dollars to 2,240,000,000. I'm trying to write a regex expression that will find all these values, but I'm failing hard. I tried something like:

^\{USD}?(\d*(\d\.?|\.\d{1,2}))$

but didn't work. I appreciate any help :)

EDIT: For clarification, I have a text with several dollar values in it, ranging from 2 ~ 2,000,000,000.

The text is something like:

"The base purchase is USD 2,00. (...) The amount equal to US 2,300,000 which refers to the premium package. (...) The country needs USD 300,00..."

I want to find and extract these values (USD + numbers) and save it to a list, each value as a different element. Thank you

Upvotes: 1

Views: 260

Answers (2)

zakinster
zakinster

Reputation: 10698

Multiple things are wrong in your expression : ^\{USD}?(\d*(\d\.?|\.\d{1,2}))$

  1. \{USD}? in regex language this would mean: expect the { literal character followed by USD followed by the character } if any. If you want to have an optional group USD you have to use parenthesis without \: (USD)?. You can use a non-capturing group for this : (?:USD)?.

This would give : ^(USD)?(\d*(\d\.?|\.\d{1,2}))$

  1. (\d\.?|\.\d{1,2}), the whole group should be repeated in order to match the entire string : (\d\.?|\.\d{1,2})*

This would give : ^(USD)?(\d*(\d\.?|\.\d{1,2})*)$

  1. \d\.?: if this is supposed to match the part with a thousand separator it should be a comma not a point regarding your example: \d*,?

This would give : ^(USD)?(\d*(\d,?|\.\d{1,2})*)$

  1. (\d*(\d: this won't work, the second \d will never match because all digit will be consumed by the first \d*, you could use the non-greedy operator ? like this: (\d*?(\d but it's not pretty.

This would give : ^(USD)?(\d*?(\d,?|\.\d{1,2})*)$ which may work for you, but looks less than optimal.

An alternative would be to build your regular expression without an "or" clause using the following parts :

  1. The prefix : "USD ", optional and with optional space : (USD ?)?
  2. The integral part of the amount before the thousand separators: \d+
  3. The integral part of the amount with a thousand separator, optional and repeatable: (,\d+)*
  4. The decimal part, optional : (\.\d+)?

Wich would give something like that: (USD ?)?(\d+)(,\d+)*(\.\d+)?

You can test it on regex101.com

You can further restrict the number of digits in each parts to avoid false-positive :

(USD ?)?(\d{1,3})(,\d{3})*(\.\d{1,2})?

A final version would be optimized with non-capturing groups unless necessary:

(?:USD ?)?(?:\d{1,3})(?:,\d{3})*(?:\.\d{1,2})?

Edit: the test case you provided uses incoherent use of decimal separators (sometime ".", sometimes ","). If you really want to match that, you can use a character class like this :

(?:USD ?)?(?:\d{1,3})(?:,\d{3})*(?:[.,]\d{1,2})?

Which matches every number in your example : Regex 101 screenshot

Upvotes: 3

Etienne Herlaut
Etienne Herlaut

Reputation: 586

Ok, let's start with

import re
text = "The base purchase is USD 2,00.00 (...) The amount equal to US 2,300,000 which refers to the premium package. (...) The country needs USD 300,00..."

As, @zakinster proposed, you can find the string numbers that interest you with :

regex = r"(?:USD)?(?:\d+,)*\d+(?:\.\d+)?"
numbers = re.findall(regex, text)

Then, to filter the one you've mentionned :

def toInteger(s): return int(s.split('.')[0].replace(',',''))

def numberBetween(string,lowerBound,upperBound): 
    intValue = toInteger(string)
    return True if intValue>lowerBound & intValue<upperBound else False

print(list(filter(lambda x: numberBetween(x,2,2240000000),numbers)))

should give you what you want :

['2,00.00', '2,300,000', '300,00']

Upvotes: 0

Related Questions