Brian
Brian

Reputation: 8275

Efficiently filter strings based on a set of substrings

My general question is how to filter an arbitrary set of strings against an arbitrary set of possible substrings.

My particular problem is that I have long lists of filenames that contain UUIDs, and I have a list of other UUIDs that I want to filter out from those. My datasets can vary in terms of length and what proportion of them end up hitting a match, from n = 10 to 1,000,000, and hits from 0% to 100%. I don't think the match rate is a big factor in the performance of my approach, but I know that size does matter for it.

My naive approach is below. Essentially, from a vector of UUIDs, I concatenate them all together with a | to make a single long regex and test every member of the original data against it. This is very fast for small data, but rapidly gets out of hand.

Setup:

# create a master population to choose from
uuid_list <- uuid::UUIDgenerate(n = 1e5)

filter_from_string <- function(n) {
  
  # choose a representative population
  uuid_population <- sample(uuid_list, n)
  
  # generate the longer strings
  data_strings <- 
    paste0(
      "Here_are_some_strings_with_associated_UUIDs_",
      uuid_population,
      "_and_additional_metadata.json"
    )
  
  # generate a small sample to check against
  uuid_sample <- sample(uuid_population, n/10)
  
  # check against a single long regex that looks like:
  #   "uuid1|uuid2|uuid3|..."
  filter_index <- 
    stringr::str_detect(
      data_strings,
      paste(uuid_sample, collapse = "|")
    )
  
  data_strings[!filter_index]
}

Test:

x <- 
  microbenchmark::microbenchmark(
    "n=10" = filter_from_string(10),
    "n=100" = filter_from_string(100),
    "n=1000" = filter_from_string(1000),
    "n=10000" = filter_from_string(10000),
    times = 10
  )
#> Warning in microbenchmark::microbenchmark(`n=10` = filter_from_string(10), :
#> less accurate nanosecond times to avoid potential integer overflows

# milliseconds
summary(x, "ms")[c("expr", "mean", "median")]
#>      expr         mean       median
#> 1    n=10    0.4201393    0.0713195
#> 2   n=100    1.4376527    0.4230585
#> 3  n=1000   25.4073679   25.8098075
#> 4 n=10000 2340.1810916 2313.1806605

# relative time
summary(x, "relative")[c("expr", "mean", "median")]
#>      expr        mean       median
#> 1    n=10    1.000000     1.000000
#> 2   n=100    3.421848     5.931877
#> 3  n=1000   60.473676   361.889911
#> 4 n=10000 5570.012354 32434.056051

Created on 2023-05-03 with reprex v2.0.2

As you can see, the performance is much longer than 10x each step.

I have a fix for how to speed it up in my particular use case. I can extract out the UUIDs from the string since it's a known pattern, then do an exact equality match against the sample. I'll post this in an answer below, but I'm curious about what to do when there's not an obvious pattern to use, but rather any population of strings and possible substrings.

Upvotes: 1

Views: 96

Answers (2)

Brian
Brian

Reputation: 8275

A general approach, which doesn't rely on the questionable long-regex technique or on having a pre-defined pattern of substrings. Loop over each one of the substrings to generate a logical index, then bind all indices and use rowSums to get a final match index.

Like the long-regex technique in the question, this has good performance for small n, but spirals out of control between 1000 and 10000.

This approach has potential to be sped up by using parallel processing with {future}-aware version of a loop, for example with the {furrr} package.

Alternative attempts:

  • I tried this with str_detect(string, fixed(target)) and found that it didn't meaningfully change the performance.
  • I tried another method of generating the index matrix with outer(strings, targets, ...) and it was substantially slower at all n.
  • Pasting all the population strings into one long string with padding, so that you can easily index into the "document" once you find a match, vectorized over the target strings. This was even worse than the naive | approach.
  • Modifying the naive | approach to first test the first chunk of the UUIDs with str_sub(uuid_sample,1,8), then using the full target strings on the subset of possible matches. This doubled the speed of the original method.

Ultimately it seems like the best method really depends on the size and content of both the population of strings and the targets, so there's no "correct approach" to the general problem in R.

Setup:

# create a master population to choose from
uuid_list <- uuid::UUIDgenerate(n = 1e5)

filter_with_loop <- function(n) {
  
  # choose a representative population
  uuid_population <- sample(uuid_list, n)
  
  # generate the longer strings
  data_strings <- paste0(
    "Here_are_some_strings_with_associated_UUIDs_",
    uuid_population,
    "_and_additional_metadata.json"
  )
  
  # generate a small sample to check against
  uuid_sample <- sample(uuid_population, n/10)
  
  # for each target substring, test all the main population
  match_indices <- 
    lapply(
      uuid_sample,
      function(s) { stringr::str_detect(data_strings, s) }
    )
  
  # coalesce indices into a main index
  match_idx <- rowSums(as.data.frame(match_indices))
  
  # filter on exact match of the UUID instead of regex pattern
  data_strings[!match_idx]
}

Test:

z <- 
  microbenchmark::microbenchmark(
    "n=10" = filter_with_loop(10),
    "n=100" = filter_with_loop(100),
    "n=1000" = filter_with_loop(1000),
    "n=10000" = filter_with_loop(10000),
    times = 10
  )
#> Warning in microbenchmark::microbenchmark(`n=10` = filter_with_loop(10), : less
#> accurate nanosecond times to avoid potential integer overflows

# milliseconds
summary(z, "ms")[c("expr", "mean", "median")]
#>      expr         mean       median
#> 1    n=10    1.4567997    0.1490760
#> 2   n=100    0.8187249    0.7993565
#> 3  n=1000   29.2275593   29.5053015
#> 4 n=10000 2852.6977663 2843.7028870

# relative time
summary(z, "relative")[c("expr", "mean", "median")]
#>      expr         mean       median
#> 1    n=10    1.0000000     1.000000
#> 2   n=100    0.5620024     5.362074
#> 3  n=1000   20.0628537   197.921205
#> 4 n=10000 1958.1949161 19075.524477

Created on 2023-05-05 with reprex v2.0.2

Upvotes: 0

Brian
Brian

Reputation: 8275

For this specific use case with a well-behaved pattern, you can extract out the target substring from the longer names, and then do exact equality matching against the samples.

Setup:

# create a master population to choose from
uuid_list <- uuid::UUIDgenerate(n = 1e5)

filter_from_string_extract <- function(n) {
  
  # choose a representative population
  uuid_population <- sample(uuid_list, n)
  
  # generate the longer strings
  data_strings <- paste0(
        "Here_are_some_strings_with_associated_UUIDs_",
        uuid_population,
        "_and_additional_metadata.json"
      )
  
  # pre-extract just the UUID into a new vector
  data_uuids <- 
    stringr::str_extract(
      data_strings,
      "[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}"
    )
  
  # generate a small sample to check against
  uuid_sample <- sample(uuid_population, n/10)
  
  # filter on exact match of the UUID instead of regex pattern
  data_strings[!(data_uuids %in% uuid_sample)]
}

Test:

y <- 
  microbenchmark::microbenchmark(
    "n=10" = filter_from_string_extract(10),
    "n=100" = filter_from_string_extract(100),
    "n=1000" = filter_from_string_extract(1000),
    "n=10000" = filter_from_string_extract(10000),
    "n=100000" = filter_from_string_extract(100000),
    times = 10
  )
#> Warning in microbenchmark::microbenchmark(`n=10` =
#> filter_from_string_extract(10), : less accurate nanosecond times to avoid
#> potential integer overflows

# milliseconds
summary(y, "ms")[c("expr", "mean", "median")]
#>       expr        mean      median
#> 1     n=10   0.0994824   0.0759730
#> 2    n=100   0.2499114   0.2374515
#> 3   n=1000   1.8917400   1.8155005
#> 4  n=10000  18.4703319  17.4055865
#> 5 n=100000 204.7527167 199.9893490

# relative time
summary(y, "relative")[c("expr", "mean", "median")]
#>       expr        mean      median
#> 1     n=10    1.000000    1.000000
#> 2    n=100    2.512117    3.125472
#> 3   n=1000   19.015826   23.896654
#> 4  n=10000  185.664318  229.102267
#> 5 n=100000 2058.180308 2632.373988

Created on 2023-05-03 with reprex v2.0.2

Even going up another order of magnitude, there's acceptable performance, and it appears to scale linearly with n.

Upvotes: 1

Related Questions