Reputation: 1819
In R, grep usually matches a vector of multiple strings against one regexp.
Q: Is there a possibility to match a single string against multiple regexps? (without looping through each single regexp pattern)?
Some background:
I have 7000+ keywords as indicators for several categories. I cannot change that keyword dictionary. The dictionary has following structure (keywords in col 1, numbers indicate categories where these keywords belong to):
ab 10 37 41
abbrach* 38
abbreche 39
abbrich* 39
abend* 37
abendessen* 60 63
aber 20 23 45
abermals 37
Concatenating so many keywords with "|" is not a feasible way (and I wouldn't know which of the keywords generated the hit). Also, just reversing "patterns" and "strings" does not work, as the patterns have truncations, which wouldn't work the other way round.
[related question, other programming language]
Upvotes: 30
Views: 23526
Reputation: 9423
re2r
package can match multiple patterns (in parallel). Minimal example:
# compile patterns
re <- re2r::re2(keywords)
# match strings
re2r::re2_detect(strings, re, parallel = TRUE)
Upvotes: 3
Reputation: 2988
To expand on the other answer, to transform the sapply()
output into a useful logical vector you need to further use an apply()
step.
keywords <- c("dog", "cat", "bird")
strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")
(matches <- sapply(keywords, grepl, strings, ignore.case=TRUE))
# dog cat bird
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE TRUE
# [3,] FALSE FALSE FALSE
To know which strings contain any of the keywords (patterns):
apply(matches, 1, any)
# [1] TRUE TRUE FALSE
To know which keywords (patterns) were matched in the supplied strings:
apply(matches, 2, any)
# dog cat bird
# TRUE TRUE TRUE
Upvotes: 2
Reputation: 2285
What about applying the regexpr function over a vector of keywords?
keywords <- c("dog", "cat", "bird")
strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")
sapply(keywords, regexpr, strings, ignore.case=TRUE)
dog cat bird
[1,] 15 -1 -1
[2,] -1 4 15
[3,] -1 -1 -1
sapply(keywords, regexpr, strings[1], ignore.case=TRUE)
dog cat bird
15 -1 -1
Values returned are the position of the first character in the match, with -1
meaning no match.
If the position of the match is irrelevant, use grepl
instead:
sapply(keywords, grepl, strings, ignore.case=TRUE)
dog cat bird
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE TRUE
[3,] FALSE FALSE FALSE
Update: This runs relatively quick on my system, even with a large number of keywords:
# Available on most *nix systems
words <- scan("/usr/share/dict/words", what="")
length(words)
[1] 234936
system.time(matches <- sapply(words, grepl, strings, ignore.case=TRUE))
user system elapsed
7.495 0.155 7.596
dim(matches)
[1] 3 234936
Upvotes: 34