Reputation: 1424
I have a set of strings like:
[
"ERDF : EUR 2.7 million",
"ERDF : EUR 961 000",
"ERDF: 7 305 000 DKR (+/- EUR 974 000) ",
"FEOGA: 40 826 EUR",
"49 % of eligible costs",
"ERDF contribution: 64%",
"FEDER (Objectif 5b 1994-1996) 60 979 euros (400 000 FRF)",
"FEDER, Objectif 2, 1994 - 1999: 1 116 000 EUR",
"EUR 8.000.000",
"EUR 7.200.000",
"EUR 4.200.000",
"4.2 million euros",
"EUR 0.2 million",
"EUR 0.6 million",
"FEDER: 830 842 euros (5 450 000 FRF)",
"EUR 7,220,000,000",
"DKR 1 220 000 + DKR 1 380 000 ",
"GBP 150 000" ]
Online at regex101.com
I would like to capture the numbers (with 'million' if present) that have eur*
either as prefix or a suffix. The below cases should match the expression
10 million euros
EURO 5.000
EUR 100
My current regular expression works only if eur*
is before the number
/(\beur[a-z]*|€)+[\s\d\,\.|million]*\b/gi
Upvotes: 1
Views: 288
Reputation: 1091
You can also try this ((?=[\w\s\d\.]+eur)|(?=[\w\s\d\.]+EUR))(eur(os)?|EUR(OS)?|million|\s|\d|\.)+?(?=$|\(|\))
Upvotes: 0
Reputation: 6125
Basic Answer
I'd recommend something like
/(?:eur\w*|€)?\s*([0-9\., ]+)\s*(million)?\s*(?:eur\w*|€)?/i
This recognizes the number and the "million" as two separate capture groups, and matches each of your given examples:
EUR 7.200.000
--> Group 1 = 7.200.000
euro 4 000
--> Group 1 = 4 000
EUR 0.2 million
--> Group 1 = 0.2
, Group 2 = million
Main project: 300 000 EUR
--> Group 1 = 300 000
1998: 43.000.000 euros
--> Group 1 = 43.000.000
Here's a live example at Regex101 of the regex that you can play with.
A More Complete Answer
Now, that said, this answer isn't exactly like the original request, since it matches bare numbers as well. If you need one that definitely only matches numbers preceded or followed by eur
, you'll need to duplicate and split the regex, like this:
/(?:eur\w*|€)\s*([0-9\., ]+)\s*(million)?|([0-9\., ]+)\s*(million)?\s*(?:eur\w*|€)/i
This correctly captures all of your original examples above, but won't capture bare numbers.
I have a live example on Regex101 of this form as well.
Here's the same regex matched against the extended dataset you provided; notice that it doesn't match francs, percentages, pounds, or any of the other undesirable values, but correctly extracts every euro.
Going Beyond the Question
As suggested by @blhsing, there may be some value in including \b
word boundaries so that this doesn't match something like Grandeur 100
. Those word-boundary characters belong before the eur
in the regex:
/(?:\beur\w*|€)\s*([0-9\., ]+)\s*(million)?|([0-9\., ]+)\s*(million)?\s*(?:\beur\w*|€)/i
An Odd Special Case
Radu asks why the example above doesn't correctly match this:
ERDF : EUR 2.7 million
Or, more specifically, he wonders why it results in a capture of just . The answer is that regexes are greedy: They capture from left-to-right, grabbing as much as they can as soon as they can. So as soon as the regex engine sees
EUR
, it can correctly capture that as the answer, because we've allowed to be a "number"!
The way to fix this is to require that every "number" at least start with an actual digit — starting with .
or ,
or shouldn't be allowed. We can do that by extending the number part like this:
[0-9\., ]+
(one or more of these digit-like characters)
[0-9][0-9\., ]*
(only a digit, and then zero or more of other characters)Thus an extended regex that doesn't mis-capture Radu's other example by being too greedy (and that includes the word boundaries just for the heck of it) is this somewhat ugly-looking beast:
/(?:\beur\w*|€)\s*([0-9][0-9\., ]*)\s*(million)?|([0-9][0-9\., ]*)\s*(million)?\s*(?:\beur\w*|€)/i
Regex Learning
How does this regex work? It uses a few basic pieces that the original didn't:
(?:...)
, which is a non-capturing group: (?:...)
is just like parentheses, and groups things together for precedence, but it doesn't actually capture its contents as part of the output.?
.Given that knowledge, we can break down the full pattern (copied again below) into its logical chunks:
/(?:eur\w*|€)\s*([0-9\., ]+)\s*(million)?|([0-9\., ]+)\s*(million)?\s*(?:eur\w*|€)/i
(?:eur\w*|€)
first matches the eur...
part.\s*
matches optional whitespace.([0-9\., ]+)
captures the number.\s*
(million)?
([0-9\., ]+)
\s*
matches some optional whitespace.(million)?
\s*
eur...
: (?:eur\w*|€)
Upvotes: 2