JelenaČuklina
JelenaČuklina

Reputation: 3752

Find first match of a substring in a column of big data.table

I have a big data table, where I want to check if a 103a_foo is present. However, the filenames in a big table they are written differently, so I have to use regex.

dt = structure(list(myID = c("86577", "34005","34005", 
"194000", "30252", "71067"), 
filename = c("/scratch/tmpdir/12a_foo.mzXML.gz", 
"/scratch/tmpdir/103b_foo.XML.gz", "/scratch/tmpdir/103a_foo.XML.gz",
 "/scratch/tmpdir/103a_foo.XML.gz", 
"/scratch/tmpdir/100b_foo.XML.gz", "/scratch/tmpdir/108a_foo.XML.gz")),
 class = c("data.table", "data.frame"), 
row.names = c(NA, -5L), 
.Names = c("myID", "filename"))

As an output, I want an index 3, as this is a first time it occurs. I would have used grep('103a_foo', dt$filename)[1], but I want the search to stop at first occurrence as the table is large (10 million rows).

Upvotes: 1

Views: 462

Answers (2)

nicola
nicola

Reputation: 24480

As @Roland pointed out, with grep you won't be able to stop at the first match. However, if you need to do the operations you described often, it could be helpful to extract once and for all the "base names" you are about to look and then use match (which actually breaks at the first occurrence). Something like:

#this line might not work depending on the actual format of your real data
basenames<-gsub("^.*/|\\..*$","",dt$filename)
#then we use match
match("103a_foo",basenames)
#[1] 3

Upvotes: 2

Roland
Roland

Reputation: 132706

If you set fixed = TRUE it doesn't take that long. Is it really too slow for your needs?

x <- sample(dt$filename, 1e7, TRUE)
library(microbenchmark)
microbenchmark(grep('103a_foo', x),
               grep('103a_foo',dt$filename, fixed = TRUE), 
               times = 5)

#Unit: milliseconds
#                              expr       min       lq      mean   #median        uq       max neval cld
#               grep("103a_foo", x) 2124.8178 2125.707 2128.7849 2127.542 2128.2054 2137.6532     5   b
# grep("103a_foo", x, fixed = TRUE)  826.2298  826.597  832.7058  829.969  840.1974  840.5359     5  

To my knowledge, there is no efficient way to implement a grep that breaks out of the vectorization loop using pure R. You could use Rcpp, if you need this often.

Upvotes: 6

Related Questions