Radu Chiriac
Radu Chiriac

Reputation: 1424

regex for numbers placed either before or after a given string

I have a set of strings like:

[
  "ERDF : EUR 2.7 million",
  "ERDF : EUR 961 000",
  "ERDF: 7 305 000 DKR (+/- EUR    974 000) ",
  "FEOGA: 40 826 EUR",
  "49 % of eligible costs",
  "ERDF contribution: 64%",
  "FEDER (Objectif 5b 1994-1996) 60 979 euros (400 000 FRF)",
  "FEDER, Objectif 2, 1994 - 1999: 1 116 000 EUR",
  "EUR 8.000.000",
  "EUR 7.200.000",
  "EUR 4.200.000",
  "4.2 million euros",
  "EUR 0.2 million",
  "EUR 0.6 million",
  "FEDER: 830 842 euros (5 450 000 FRF)",
  "EUR  7,220,000,000",
  "DKR 1 220 000 + DKR 1 380 000 ",
  "GBP 150 000" ]

Online at regex101.com

I would like to capture the numbers (with 'million' if present) that have eur* either as prefix or a suffix. The below cases should match the expression

10 million euros
EURO 5.000
EUR 100

My current regular expression works only if eur* is before the number

/(\beur[a-z]*|€)+[\s\d\,\.|million]*\b/gi

Upvotes: 1

Views: 288

Answers (2)

Nambi_0915
Nambi_0915

Reputation: 1091

You can also try this ((?=[\w\s\d\.]+eur)|(?=[\w\s\d\.]+EUR))(eur(os)?|EUR(OS)?|million|\s|\d|\.)+?(?=$|\(|\))

Regex

Upvotes: 0

Sean Werkema
Sean Werkema

Reputation: 6125

Basic Answer

I'd recommend something like

/(?:eur\w*|€)?\s*([0-9\., ]+)\s*(million)?\s*(?:eur\w*|€)?/i

This recognizes the number and the "million" as two separate capture groups, and matches each of your given examples:

  • EUR 7.200.000 --> Group 1 = 7.200.000
  • euro 4 000 --> Group 1 = 4 000
  • EUR 0.2 million --> Group 1 = 0.2, Group 2 = million
  • Main project: 300 000 EUR --> Group 1 = 300 000
  • 1998: 43.000.000 euros --> Group 1 = 43.000.000

Here's a live example at Regex101 of the regex that you can play with.

A More Complete Answer

Now, that said, this answer isn't exactly like the original request, since it matches bare numbers as well. If you need one that definitely only matches numbers preceded or followed by eur, you'll need to duplicate and split the regex, like this:

/(?:eur\w*|€)\s*([0-9\., ]+)\s*(million)?|([0-9\., ]+)\s*(million)?\s*(?:eur\w*|€)/i

This correctly captures all of your original examples above, but won't capture bare numbers.

I have a live example on Regex101 of this form as well.

Here's the same regex matched against the extended dataset you provided; notice that it doesn't match francs, percentages, pounds, or any of the other undesirable values, but correctly extracts every euro.

Going Beyond the Question

As suggested by @blhsing, there may be some value in including \b word boundaries so that this doesn't match something like Grandeur 100. Those word-boundary characters belong before the eur in the regex:

/(?:\beur\w*|€)\s*([0-9\., ]+)\s*(million)?|([0-9\., ]+)\s*(million)?\s*(?:\beur\w*|€)/i

An Odd Special Case

Radu asks why the example above doesn't correctly match this:

ERDF : EUR 2.7 million

Or, more specifically, he wonders why it results in a capture of just . The answer is that regexes are greedy: They capture from left-to-right, grabbing as much as they can as soon as they can. So as soon as the regex engine sees EUR, it can correctly capture that as the answer, because we've allowed to be a "number"!

The way to fix this is to require that every "number" at least start with an actual digit — starting with . or , or shouldn't be allowed. We can do that by extending the number part like this:

  • [0-9\., ]+ (one or more of these digit-like characters)
    • becomes --> [0-9][0-9\., ]* (only a digit, and then zero or more of other characters)

Thus an extended regex that doesn't mis-capture Radu's other example by being too greedy (and that includes the word boundaries just for the heck of it) is this somewhat ugly-looking beast:

/(?:\beur\w*|€)\s*([0-9][0-9\., ]*)\s*(million)?|([0-9][0-9\., ]*)\s*(million)?\s*(?:\beur\w*|€)/i

Regex Learning

How does this regex work? It uses a few basic pieces that the original didn't:

  • First, it makes extensive use of (?:...), which is a non-capturing group: (?:...) is just like parentheses, and groups things together for precedence, but it doesn't actually capture its contents as part of the output.
  • Some of the versions of this regex also make some of the content optional using ?.

Given that knowledge, we can break down the full pattern (copied again below) into its logical chunks:

/(?:eur\w*|€)\s*([0-9\., ]+)\s*(million)?|([0-9\., ]+)\s*(million)?\s*(?:eur\w*|€)/i
  • On the left-hand side:
    • (?:eur\w*|€) first matches the eur... part.
    • Then \s* matches optional whitespace.
    • Then ([0-9\., ]+) captures the number.
    • There's more optional whitespace: \s*
    • Then finally, we capture an optional "million": (million)?
  • On the right-hand side:
    • First, we match and capture the number: ([0-9\., ]+)
    • Then \s* matches some optional whitespace.
    • Then we capture an optional "million": (million)?
    • Then some more optional whitespace: \s*
    • Finally, we make sure that it's followed by eur...: (?:eur\w*|€)

Upvotes: 2

Related Questions