Find first match of a substring in a column of big data.table

Question

I have a big data table, where I want to check if a 103a_foo is present. However, the filenames in a big table they are written differently, so I have to use regex.

dt = structure(list(myID = c("86577", "34005","34005", 
"194000", "30252", "71067"), 
filename = c("/scratch/tmpdir/12a_foo.mzXML.gz", 
"/scratch/tmpdir/103b_foo.XML.gz", "/scratch/tmpdir/103a_foo.XML.gz",
 "/scratch/tmpdir/103a_foo.XML.gz", 
"/scratch/tmpdir/100b_foo.XML.gz", "/scratch/tmpdir/108a_foo.XML.gz")),
 class = c("data.table", "data.frame"), 
row.names = c(NA, -5L), 
.Names = c("myID", "filename"))

As an output, I want an index 3, as this is a first time it occurs. I would have used grep('103a_foo', dt$filename)[1], but I want the search to stop at first occurrence as the table is large (10 million rows).

Roland · Accepted Answer

If you set fixed = TRUE it doesn't take that long. Is it really too slow for your needs?

x <- sample(dt$filename, 1e7, TRUE)
library(microbenchmark)
microbenchmark(grep('103a_foo', x),
               grep('103a_foo',dt$filename, fixed = TRUE), 
               times = 5)

#Unit: milliseconds
#                              expr       min       lq      mean   #median        uq       max neval cld
#               grep("103a_foo", x) 2124.8178 2125.707 2128.7849 2127.542 2128.2054 2137.6532     5   b
# grep("103a_foo", x, fixed = TRUE)  826.2298  826.597  832.7058  829.969  840.1974  840.5359     5

To my knowledge, there is no efficient way to implement a grep that breaks out of the vectorization loop using pure R. You could use Rcpp, if you need this often.

Find first match of a substring in a column of big data.table

Answers (2)

Related Questions