rstringperformancesubstringpattern-matching

Reputation: 651

A more efficient way to check which substrings are present in several test strings?

I would like to write an R function that takes a string as input, checks if several thousand substrings are present in that string, and returns a vector of substrings found in the tested string.

I wrote code to do this, but it would be unacceptably slow if I had to call the function many times for 10,000-20,000 different values of "tested_string":

# The function
# This function would actually do many other things to the tested string too, but for now I'm just testing the slow step
check_substrings <- function(tested_string,substrings) {
  result <- sapply(substrings,function(substring,tested_string) { return(any(grepl(substring,tested_string,fixed=T)))},tested_string=tested_string)
  return(names(result)[result])
}

# Testing speed
library('stringi') # Used for generating random strings for testing

# Setting up to test it
# Make random substrings of varying length
set.seed(5)
substrings <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),
                stri_rand_strings(30000,7,pattern='[A-Z]'),
                stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings so they won't be part of the timing below
set.seed(5)
teststrings <- unique(stri_rand_strings(100,20,pattern='[A-Z]'))
teststrings_1k <- stri_rand_strings(1000,20,pattern='[A-Z]')

# Time how long it takes to check 100 tested strings this way
# (I'll actually need to do 10,000 to 20,000 tested strings)
system.time(
  for(tstring in teststrings) {
    x <- check_substrings(tstring,substrings) 
  }
)
# user  system elapsed 
# 12.457   0.046  12.499
# At a rate of 12.4 seconds per 100 test strings, it would take 41.3 minutes to do 20k test strings

I know there are several functions like grepl() and stri_detect() that can solve the opposite problem (check a vector of multiple strings for a single pattern). Turning the problem sideways and checking all possible tested strings for each pattern, one pattern at a time speeds things up considerably:

system.time(
  {
    # Initial check for which substrings are in which test strings
    res_matrix <- sapply(substrings,grepl,x=teststrings_1k,fixed=T)
    # Turn result into a list of which substrings are in which test strings
    rownames(res_matrix) <- teststrings_1k
    res_list <- apply(res_matrix,1,function(x) { return(names(x)[x])})
  }
)
# user  system elapsed 
# 3.641   0.227   3.904
# This is much better - at a rate of 3.9 seconds per 1000 test strings it will take
# take 1.3 minutes to do 20,000 test strings

I can use this second approach if I have to, but it would be a somewhat messy solution because I would have to pre-do this for all substrings and test strings together ahead of time (rather than letting the function that does a bunch of things to each particular test string do this check for that string on the fly).

Is there a better/more efficient way to do something like my check_substrings() function in the first example (check one string for multiple substrings), or am I better off sticking with the second example despite the other complications it would cause?

Upvotes: 5

Answers (2)

Tim G

Reputation: 4147

I wrote a cpp-function that calculates a substring matrix in 5.3 seconds for 20.000 values of "testedstrings" and another string-vector which you called "substrings".

Under the hood, stri_extract_first_fixed() executes a C++ function which makes it so fast. However, there is some overhead to this function which I guess comes from various checks. Also it checks only one string for a pattern vector which leads us to use sapply. We can write a much faster C++ function which does the same but without the need for sapply. We use OpenMP parallelization and calculate a logical matrix directly in C++.

How my function works

> strings <- c("apple", "banana", "cherry", "pineapple")
> patterns <- c("an", "apple", "berry")
> res <- string_detect_multiple(strings, patterns)
> rownames(res) <- patterns
> colnames(res) <- strings
> res
      apple banana cherry pineapple
an    FALSE   TRUE  FALSE     FALSE
apple  TRUE  FALSE  FALSE      TRUE
berry FALSE  FALSE  FALSE     FALSE

In your R-directory create a file "string_detect_multiple.cpp":

string_detect_multiple.cpp

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
LogicalMatrix string_detect_multiple(CharacterVector strings, CharacterVector patterns) {
  const int n_strings = strings.size();
  const int n_patterns = patterns.size();
  LogicalMatrix results(n_patterns, n_strings);
  
  // Pre-convert patterns to std::string to avoid repeated conversion
  std::vector<std::string> pattern_strings(n_patterns);
  for (int i = 0; i < n_patterns; ++i) {
    pattern_strings[i] = std::string(patterns[i]);
  }
  
  // Pre-convert strings to avoid repeated conversion
  std::vector<std::string> str_vec(n_strings);
  for (int j = 0; j < n_strings; ++j) {
    str_vec[j] = std::string(strings[j]);
  }
  
#pragma omp parallel for collapse(2)
  for (int i = 0; i < n_patterns; ++i) {
    for (int j = 0; j < n_strings; ++j) {
      results(i, j) = str_vec[j].find(pattern_strings[i]) != std::string::npos;
    }
  }
  
  return results;
}

Import faster function to R

We then use this function in R making use of the package Rccp. Create this R-Script in the same folder and load in the CPP-Script like shown:

library(stringi)
library(microbenchmark)
setwd(dirname(rstudioapi::getSourceEditorContext()$path)) # set the current script's location as working directory

# Test data
set.seed(5)
pat2 <- unique(c(stri_rand_strings(20000,6,pattern='[A-Z]'),stri_rand_strings(30000,7,pattern='[A-Z]'), stri_rand_strings(40000,8,pattern='[A-Z]')))
# Pre-generate random tested strings 
test_string <- "FRXHCSNVYCHM" 
str1 <- c("FRXHCSNVYCHM",stri_rand_strings(1000,20,pattern='[A-Z]'))


# get matrix with stri_detect_fixed
res_sdf  <-  sapply(str1, stri_detect_fixed, pat2)
# rownames(res_sdf_m) <- pat2 # optional name rows
# import our own C++ function :)
#install.packages("Rcpp")
library(Rcpp)
Sys.setenv("PKG_CXXFLAGS" = "-fopenmp")
Sys.setenv("PKG_LIBS" = "-fopenmp")
sourceCpp("string_detect_multiple.cpp")
# use

res_cpp <- string_detect_multiple(str1, pat2)
colnames(res_cpp) <- str1 # we don't really count that into the function time!! :)
# rownames(res_cpp) <- pat2 # optional name rows

identical(res_cpp, res_sdf)
# benchmark

bm <- microbenchmark::microbenchmark(
  stringi = sapply(str1, stri_detect_fixed, pat2),
  cpp = string_detect_multiple(str1, pat2),
  times = 5
)
bm
boxplot(bm)

Benchmark Results

Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval cld
 stringi 6543.8222 6609.0601 6627.8566 6623.9730 6675.8684 6686.5595     5  a 
     cpp  244.3382  244.6394  266.1575  251.5274  270.3496  319.9331     5   b

As you can see, that makes it 25 times faster than sapply(str1, stri_detect_fixed, pat2).

Single Test-String

Of course, you can use my function to mimic the behavior of your check_substrings() but 112 times as fast and 2.677906 times faster than stringi:

check_subsstrings(test_string, pat2) # run your function
pat2[string_detect_multiple(test_string, pat2)] # run mine

Upvotes: 4

margusl

Reputation: 17544

/../ stri_detect() that can solve the opposite problem (check a vector of multiple strings for a single pattern)/../

Recycling in stringi works both ways:

# many strings - one pattern
stringi::stri_detect_fixed(str = c("ABC", "BCD", "CDE"), pattern = "A")
#> [1]  TRUE FALSE FALSE

# one string - many patterns
stringi::stri_detect_fixed(str = "ABC", pattern = c("A", "B", "D"))
#> [1]  TRUE  TRUE FALSE

So you could rewrite that check with stri_extract_first_fixed() or stri_detect_fixed() + subsetting (tiny bit faster than omitting NA-values from stri_extract results) to look something like this:

library(stringi)
check_substrings_stringi <- \(tested_string, substrings) substrings[stri_detect_fixed(tested_string, substrings)]

Single tested_string and absolute time values:

bench::mark(
  check_substrings(teststrings[1], substrings),
  check_substrings_stringi(teststrings[1], substrings),
  iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 check_substrings(teststrings[1… 226.98ms 236.64ms      4.15    3.75MB     26.1
#> 2 check_substrings_stringi(tests…   4.36ms   4.44ms    210.    709.75KB      0

Full sweep over teststrings and relative execution times:

bench::mark(
  check_substrings         = lapply(teststrings, check_substrings, substrings = substrings),
  check_substrings_stringi = lapply(teststrings, check_substrings_stringi, substrings = substrings),
  relative = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                 min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 check_substrings          40.8   40.8       1        5.46     5.51
#> 2 check_substrings_stringi   1      1        40.8      1        1