Reputation: 55
I'm trying to write a regex that will find currency values in my text. I have values that vary from 2 dollars to 2,240,000,000. I'm trying to write a regex expression that will find all these values, but I'm failing hard. I tried something like:
^\{USD}?(\d*(\d\.?|\.\d{1,2}))$
but didn't work. I appreciate any help :)
EDIT: For clarification, I have a text with several dollar values in it, ranging from 2 ~ 2,000,000,000.
The text is something like:
"The base purchase is USD 2,00. (...) The amount equal to US 2,300,000 which refers to the premium package. (...) The country needs USD 300,00..."
I want to find and extract these values (USD + numbers) and save it to a list, each value as a different element. Thank you
Upvotes: 1
Views: 260
Reputation: 10698
Multiple things are wrong in your expression :
^\{USD}?(\d*(\d\.?|\.\d{1,2}))$
\{USD}?
in regex language this would mean: expect the {
literal character followed by USD
followed by the character }
if any. If you want to have an optional group USD
you have to use parenthesis without \
: (USD)?
. You can use a non-capturing group for this : (?:USD)?
. This would give : ^(USD)?(\d*(\d\.?|\.\d{1,2}))$
(\d\.?|\.\d{1,2})
, the whole group should be repeated in order to match the entire string : (\d\.?|\.\d{1,2})*
This would give : ^(USD)?(\d*(\d\.?|\.\d{1,2})*)$
\d\.?
: if this is supposed to match the part with a thousand separator it should be a comma not a point regarding your example: \d*,?
This would give : ^(USD)?(\d*(\d,?|\.\d{1,2})*)$
(\d*(\d
: this won't work, the second \d
will never match because all digit will be consumed by the first \d*
, you could use the non-greedy operator ?
like this: (\d*?(\d
but it's not pretty. This would give : ^(USD)?(\d*?(\d,?|\.\d{1,2})*)$
which may work for you, but looks less than optimal.
An alternative would be to build your regular expression without an "or" clause using the following parts :
(USD ?)?
\d+
(,\d+)*
(\.\d+)?
Wich would give something like that: (USD ?)?(\d+)(,\d+)*(\.\d+)?
You can test it on regex101.com
You can further restrict the number of digits in each parts to avoid false-positive :
(USD ?)?(\d{1,3})(,\d{3})*(\.\d{1,2})?
A final version would be optimized with non-capturing groups unless necessary:
(?:USD ?)?(?:\d{1,3})(?:,\d{3})*(?:\.\d{1,2})?
Edit: the test case you provided uses incoherent use of decimal separators (sometime ".", sometimes ","). If you really want to match that, you can use a character class like this :
(?:USD ?)?(?:\d{1,3})(?:,\d{3})*(?:[.,]\d{1,2})?
Which matches every number in your example :
Upvotes: 3
Reputation: 586
Ok, let's start with
import re
text = "The base purchase is USD 2,00.00 (...) The amount equal to US 2,300,000 which refers to the premium package. (...) The country needs USD 300,00..."
As, @zakinster proposed, you can find the string numbers that interest you with :
regex = r"(?:USD)?(?:\d+,)*\d+(?:\.\d+)?"
numbers = re.findall(regex, text)
Then, to filter the one you've mentionned :
def toInteger(s): return int(s.split('.')[0].replace(',',''))
def numberBetween(string,lowerBound,upperBound):
intValue = toInteger(string)
return True if intValue>lowerBound & intValue<upperBound else False
print(list(filter(lambda x: numberBetween(x,2,2240000000),numbers)))
should give you what you want :
['2,00.00', '2,300,000', '300,00']
Upvotes: 0