Reputation: 49
I want to extract currencies with € sign in the text and my text is eg:
"€0.74 million developer fund of €2 billion carbon emission"
my regex is:
"(\u20AC)[0-9]+.[0-9]+\\s(m|b)illion+"
when I run regex on the text i am getting below output:
[[1]]
character(0)
can anyone tell what is wrong with regex and why I am not able to extract € sign even after putting (\u20AC)
for that.
Upvotes: 3
Views: 9492
Reputation: 626960
Your pattern does not match the strings because your pattern, (€)[0-9]+.[0-9]+\\s(m|b)illion+
, namely the [0-9]+.[0-9]+
part, requires at least 2 digits that should be "split" with any 1 char. It means, you may match 1t6 million
, or 1.6 billionnnn
(several n
s are matched because of quantified n
, n+
).
You do not get any matches because you werote €
as \u20AC
, or you would get 1 match, 0.74 million
:
> p = "(€)[0-9]+.[0-9]+\\s(m|b)illion+"
> str_extract_all(txt, p)
[[1]]
[1] "\u00800.74 million"
To solve the issue, you may use a base R regmatches
with gregexpr
:
> txt <- "€0.74 million developer fund of €2 billion carbon emission"
> res <- regmatches(txt, gregexpr("€[0-9]+(?:\\.[0-9]+)?\\s*[mb]illion", txt, ignore.case=TRUE))
> lapply(res, cat, "\n")
€0.74 million €2 billion
[[1]]
NULL
Note I used cat
to display the Unicode string results as these are actual extracted values.
Pattern details
€
- a euro sign[0-9]+
- 1 or more digits(?:\\.[0-9]+)?
- 1 or 0 occurrences of a .
and then 1 or more digits\\s*
- zero or more whitespaces[mb]
- m
or b
illion
- a literal substring.Upvotes: 1
Reputation: 4378
Using stringr. In your regex, \s
should be \\s
. Below uses
\\d
for digits (just simpler than [0-9]
(.\\d+)?
for optional decimal points - everything in parentheses before ?
s <- "€0.74 million developer fund of €2 billion carbon emission"
r <- "(\u20AC)\\d+(.\\d+)?\\s(m|b)illion+"
library(stringr)
str_extract_all(s,r)
# [1] "€0.74 million" "€2 billion"
Upvotes: 3
Reputation: 1340
Try to use a different code... Like
((\x80)[0-9]+.[0-9]+\\s(m|b)illion).*
This will capture the euro in a proper way
(I used gsub btw:
z = "€0.74 million developer fund of €2 billion carbon emission"
gsub("((\x80)[0-9]+.[0-9]+\\s(m|b)illion).*","\\1",z)
However, this only catches the first one now... but I think that is easily solvable)
Upvotes: 1