Reputation: 740
I'm looking for the most efficient solution to the problem posed in this question: Say you have a vector v
of strings:
set.seed(314159)
library(stringi)
library(stringr)
v <- stringi::stri_rand_strings(10000, 4, pattern = "[A-Z]")
head(v)
#> [1] "FQGK" "YNQH" "IMNJ" "WUFU" "BBAR" "BZUH"
I want to efficiently return a single logical representing whether a given pattern, say "FOO"
, matches any of the strings in v
. A prospective function would work like this:
detect("FOO")
#> FALSE
detect("BAR")
#> TRUE
There are several methods to do this with base grep
functions or using stringr::str_detect
, but these each involve first matching a regex on every element of v
, doing up to 9,999 unnecessary tests in my example. An efficient solution would stop evaluation after a single match is found.
For each solution detect.#
, I benchmark it by applying it to all three letter combinations c
:
c <- combn(LETTERS,3, FUN = function(x){paste(x, collapse = '')})
head(c)
#> [1] "ABC" "ABD" "ABE" "ABF" "ABG" "ABH"
There are a few possible solutions that I have come up with. To start, looping over v
so that unnecessary pattern matching isn't done after a match is found. As you'll see, this is a terrible idea with lots of overhead:
detect.1 <- function(pattern){
for (i in 1:length(v)){
if (length(grep(pattern, v[i]))){return(TRUE)}
}
return(FALSE)
}
Next, we could use combinations of any()
and grepl()
or stringr::str_detect()
, but then we do unnecessary match tests:
#str_detect() from stringr
detect.2 <- function(pattern){
any(str_detect(v, pattern) )
}
# any() and grepl()
detect.3 <- function(pattern){
any(grepl(pattern, v))
}
Finally, if we know a character never appears in pattern
, we can collapse v
into a single string with components separated by this character. Then a single grep
would suffice:
#collapse to long string
v_pasted <- paste(v, collapse = '_')
detect.4 <- function(pattern){
isTRUE(as.logical(grep(pattern, v_pasted)))
}
(updated to use bench::mark()
)
det1 <- expression(data.frame(c, "inV" = I(lapply(c, FUN = detect.1))))
det2 <- expression(data.frame(c, "inV" = I(lapply(c, FUN = detect.2))))
det3 <- expression(data.frame(c, "inV" = I(lapply(c, FUN = detect.3))))
det4 <- expression({
v_pasted <- paste(v, collapse = '_')
data.frame(c, "inV" = I(lapply(c, FUN = detect.4)))
})
bench::mark(
eval(det1),
eval(det2),
eval(det3),
eval(det4),
iterations = 5,
relative = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 eval(det1) 76.9 77.0 76.8 77.2 1 1 Inf 1
#> 2 eval(det2) 4.02 4.03 4.04 4.05 19.1 735. Inf 1
#> 3 eval(det3) 2.77 2.79 2.79 2.80 27.6 735. Inf 1
#> 4 eval(det4) 1 1 1 1 77.0 1.22 NaN 1
grepl
is noticeably faster than str_detect
. The pasting method is fastest, but requires you to have a separation character that doesn't appear in possible search patterns. Is there some faster alternative that I am missing?
Upvotes: 2
Views: 2129
Reputation: 3650
This function from stringi
package should be faster:
any(stri_detect_fixed(v, pattern, max_count = 1))
bench:
require(stringi)
detect.m <- function(pattern){
any(stri_detect_fixed(v, pattern, max_count = 1))
}
detm <- expression(data.frame(c, "inV" = I(lapply(c, FUN = detect.m))))
r <- bench::mark(
# eval(det1),
eval(det2),
eval(det3),
eval(det4),
eval(detm),
iterations = 5,
relative = TRUE
)
r[, 1:10]
# expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 eval(det2) 4.83 5.39 5.02 5.94 1 600. 9 1 5.39
# 2 eval(det3) 3.85 3.69 3.80 3.31 1.46 600. 10 1 3.69
# 3 eval(det4) 1.35 1.32 1.36 1.20 4.08 1 1 1 1.32
# 4 eval(detm) 1 1 1 1 5.39 600. 9 1 1
# lets create larger test case for better comparison:
a <- expand.grid(lapply(1:5, function(x) LETTERS))
a <- do.call(paste0, a)
f10 <- a[10] # lets search for 10th element
last <- a[length(a)] # and last
length(a)
length(unique(a))
v <- a
detm <- function(pattern){
any(stri_detect_fixed(v, pattern, max_count = 1))
}
det4 <- function(pattern){
# should include paste
v_pasted <- paste(v, collapse = '_')
# isTRUE(as.logical(grep(pattern, v_pasted)))
isTRUE(grepl(pattern, v_pasted, fixed = T)) # faster
}
system.time(detm(last)) # 0.74
system.time(detm(f10)) # 0.33
system.time(det4(last)) # 3.38
system.time(det4(f10)) # 3.08
Upvotes: 3