Austin Overman
Austin Overman

Reputation: 163

Using grep with very large vector of regular expressions in r

I have a rather large vector (table) with 325k unique observations that I want to use as the list of regular expression to find matches in another vector (data) of 26k observations.

I am using this code below which works well if table and the resulting list of regular expressions is less than 3000 entries (although my guess it is the character count and not the expression as a hole that matters):

matches <- unique(grep(paste(table, collapse="|"), 
                       data$ID,
                       perl = TRUE,
                       value=FALSE))

But if 'table' and the resulting list of regular expressions is any longer than this I get the error:

PCRE pattern compilation error - 'regular expression is too large'

The observations that I want to search have a mixed bag of character string patterns such as "xxx-yyyy", "L-cc-fff-C12Z3N-xxx", and even "Name.xxx-12N7t-p6" and so on. Because of this it is not at all realistic to parse out the portions of the string that may match one of the regular expressions in my 325k vector and use match() and thus my desire to use regular expressions.

What would be the best approach short of breaking my 'table' into 3000+ subsets and using the above code?

R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
In R-Studio Version 0.98.1028

Thanks for your assistance.

Upvotes: 3

Views: 2010

Answers (1)

dww
dww

Reputation: 31452

You can check each regex string one by one in an apply loop. This will be a little slow, but if speed is not important it will provide a satidfactory solution:

matches = unique(unlist(lapply(mytable, grep, x=mydata$id, value=F))). 

Some reproducible data to test this on:

mydata = data.frame(id = paste0(sample(letters, 30000, T), 
                              sample(letters, 30000, T),
                              sample(letters, 30000, T),
                              sample(letters, 30000, T)))

mytable = paste0(sample(letters, 30000, T), 
               sample(letters, 30000, T),
               sample(letters, 30000, T),
               sample(letters, 30000, T))

By the way, data and table are both reserved words in R, so not great practice to use these as variable names. I therefore called them mytable and mydata instead.

Upvotes: 0

Related Questions