Reputation: 3542
I'm learning regex matching in R using stringr
package, but I don't understand why
str_match("1,000,222.333 /month", "[\\d,]*\\.?\\d*")
[,1]
[1,] "1,000,222.333"
returns desired result, while
str_match("about $1,000,222.33 em's", "[\\d,]*\\.?\\d*")
[,1]
[1,] ""
returns empty string? Is something wrong with my[\\d,]*
?
I learned that number regex matching is complicated, so this snippet is not supposed to used in production, I just want to understand why it fails in this specific case.
Upvotes: 2
Views: 321
Reputation: 70722
To elaborate, the problem is the *
operator. Since this operator allows the regular expression engine to match zero or more characters, [\d,]*
tells the engine to match zero or more digits or the literal character ,
— which might be none at all. I would write this as follows:
str_match(x, '[\\d,]+(?:\\.\\d+)?')
Or make effective use of rm_number
( a regex I wrote for this ) from the qdapRegex package:
library(qdapRegex)
x <- c("about $1,000,222.33 em's", "1,000,222.333 /month")
rm_number(x, extract=TRUE)
# [[1]]
# [1] "1,000,222.33"
# [[2]]
# [1] "1,000,222.333"
Upvotes: 2
Reputation: 886948
You could use +
to match one or more characters rather than *
which matches 0 or more.
str_match(v1, "[\\d,]+\\.?\\d*")
# [,1]
#[1,] "1,000,222.33"
#[2,] "1,000,222.333"
v1 <- c("about $1,000,222.33 em's", "1,000,222.333 /month")
Upvotes: 3