Naveen Goud
Naveen Goud

Reputation: 49

Regex for euro sign (€)

I want to extract currencies with € sign in the text and my text is eg:

"€0.74 million developer fund  of €2 billion carbon emission"

my regex is:

"(\u20AC)[0-9]+.[0-9]+\\s(m|b)illion+" 

when I run regex on the text i am getting below output:

[[1]]
character(0)

can anyone tell what is wrong with regex and why I am not able to extract € sign even after putting (\u20AC) for that.

Upvotes: 3

Views: 9492

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626960

Your pattern does not match the strings because your pattern, (€)[0-9]+.[0-9]+\\s(m|b)illion+, namely the [0-9]+.[0-9]+ part, requires at least 2 digits that should be "split" with any 1 char. It means, you may match 1t6 million, or 1.6 billionnnn (several ns are matched because of quantified n, n+).

You do not get any matches because you werote as \u20AC, or you would get 1 match, 0.74 million:

> p = "(€)[0-9]+.[0-9]+\\s(m|b)illion+"
> str_extract_all(txt, p)
[[1]]
[1] "\u00800.74 million"

To solve the issue, you may use a base R regmatches with gregexpr:

> txt <- "€0.74 million developer fund  of €2 billion carbon emission"
> res <- regmatches(txt, gregexpr("€[0-9]+(?:\\.[0-9]+)?\\s*[mb]illion", txt, ignore.case=TRUE))
> lapply(res, cat, "\n")
€0.74 million €2 billion 
[[1]]
NULL

Note I used cat to display the Unicode string results as these are actual extracted values.

Pattern details

  • - a euro sign
  • [0-9]+ - 1 or more digits
  • (?:\\.[0-9]+)? - 1 or 0 occurrences of a . and then 1 or more digits
  • \\s* - zero or more whitespaces
  • [mb] - m or b
  • illion - a literal substring.

Upvotes: 1

Andrew Lavers
Andrew Lavers

Reputation: 4378

Using stringr. In your regex, \s should be \\s . Below uses

\\d for digits (just simpler than [0-9]

(.\\d+)? for optional decimal points - everything in parentheses before ?

s <- "€0.74 million developer fund of €2 billion carbon emission" 
r <-  "(\u20AC)\\d+(.\\d+)?\\s(m|b)illion+"
library(stringr)
str_extract_all(s,r)

# [1] "€0.74 million" "€2 billion" 

Upvotes: 3

zwep
zwep

Reputation: 1340

Try to use a different code... Like

((\x80)[0-9]+.[0-9]+\\s(m|b)illion).*

This will capture the euro in a proper way

(I used gsub btw:

z = "€0.74 million developer fund  of €2 billion carbon emission"
gsub("((\x80)[0-9]+.[0-9]+\\s(m|b)illion).*","\\1",z)

However, this only catches the first one now... but I think that is easily solvable)

Upvotes: 1

Related Questions