Reputation: 3752
I have a big data table, where I want to check if a 103a_foo
is present. However, the filenames in a big table they are written differently, so I have to use regex.
dt = structure(list(myID = c("86577", "34005","34005",
"194000", "30252", "71067"),
filename = c("/scratch/tmpdir/12a_foo.mzXML.gz",
"/scratch/tmpdir/103b_foo.XML.gz", "/scratch/tmpdir/103a_foo.XML.gz",
"/scratch/tmpdir/103a_foo.XML.gz",
"/scratch/tmpdir/100b_foo.XML.gz", "/scratch/tmpdir/108a_foo.XML.gz")),
class = c("data.table", "data.frame"),
row.names = c(NA, -5L),
.Names = c("myID", "filename"))
As an output, I want an index 3, as this is a first time it occurs. I would have used grep('103a_foo', dt$filename)[1]
, but I want the search to stop at first occurrence as the table is large (10 million rows).
Upvotes: 1
Views: 462
Reputation: 24480
As @Roland pointed out, with grep
you won't be able to stop at the first match. However, if you need to do the operations you described often, it could be helpful to extract once and for all the "base names" you are about to look and then use match
(which actually breaks at the first occurrence). Something like:
#this line might not work depending on the actual format of your real data
basenames<-gsub("^.*/|\\..*$","",dt$filename)
#then we use match
match("103a_foo",basenames)
#[1] 3
Upvotes: 2
Reputation: 132706
If you set fixed = TRUE
it doesn't take that long. Is it really too slow for your needs?
x <- sample(dt$filename, 1e7, TRUE)
library(microbenchmark)
microbenchmark(grep('103a_foo', x),
grep('103a_foo',dt$filename, fixed = TRUE),
times = 5)
#Unit: milliseconds
# expr min lq mean #median uq max neval cld
# grep("103a_foo", x) 2124.8178 2125.707 2128.7849 2127.542 2128.2054 2137.6532 5 b
# grep("103a_foo", x, fixed = TRUE) 826.2298 826.597 832.7058 829.969 840.1974 840.5359 5
To my knowledge, there is no efficient way to implement a grep
that breaks out of the vectorization loop using pure R. You could use Rcpp, if you need this often.
Upvotes: 6