detect pattern in a data.frame column with string_detect too slow in R

Question

I have a data.frame that have 50,000 rows and 194 columns. In one of the columns named "Gene" there are one or multiple entries, always following the same pattern, e.g. "gene1" or "gene1;gene2" or "gene1:gene2:gene3". Then I have a character vector with a regular expression pattern very long e.g "\bgene1$|\bgene2$|\bgene3$|\bgene4$..." up to a total of 4,000 patterns, i.e, 4,000 \bgene$.

I want to find the matches of that pattern in the column Gene of my data.frame

Here is an example of the code I am using so far

I cannot output my whole data.frame because it is too long

genes <- c("AARS", "AARS1", "SAMD11", "MUTYH", "PEGX", "BRCA1", "APC") # my real number of genes is 3,000

# then I converted the genes' vector to a regexp
genes2 <- paste0("\b", genes, "\b")

# then I try the matching
matches <- unique(grep(paste(genes2, collapse = "|"), # tib is my data.frame and Gene the column with the values I want to match
             tib$Gene, value = TRUE, perl = FALSE)) 

# And finally filtering the data.frame
tib2 <- tib %>% dplyr::filter(Gene %in% matches)

However, when I use my real data, I get an out of memory error with the grep (setting perl=FALSE), so I have tried with stringr library but it is too slow to complete the search:

test <- str_extract_all(tib$Gene.refGene, paste(genes2,collapse="|"))
test2 <- str_detect((tib$Gene.refGene, paste(genes2,collapse="|"))

Both test and test2 are too slow

Any hint on how to update

An example with less rows would be like this, courtesy of @jay.sf

d <- structure(list(gene = c("XY42", "SAMD11:XY20:XY29:XY34:XY82:XY88:XY94", 
"XY17:XY23:XY35:XY36:XY8", "MUTYH:XY43:XY62:XY85:XY91:XY92", 
"AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95", "XY2:XY22:XY28:XY69:XY77", 
"AARS1:XY11:XY17:XY62:XY75", "XY25:PEGX:XY47:XY6:XY76:XY84", 
"APC:XY31:XY36:XY48:XY51:XY65", "BRCA1"), x = c(-1.04042150945666, 
-0.4563032693248, -0.267762662765083, 0.758168827559491, -1.89440229591065, 
0.468157951289336, 0.126909754004865, -0.852405668800981, -0.917059466430073, 
-0.475954635098868)), class = "data.frame", row.names = c(NA, 
-10L))

And the list of genes is fixed genes <- c("AARS", "AARS1", "SAMD11", "MUTYH", "PEGX", "BRCA1", "APC". I want to find the exact match between members of gene list and genes in Gene column, i.e, BRCA1 (in gene list) should match only BRCA1 not BRCA11 in Gene column in the data.frame.

But bear in mind that my real gene list has 4,000 genes and my data.frame is composed of 50,000 rows

jay.sf · Accepted Answer

I'm not sure about your input and output. But assuming data like this,

d
#                                     gene          x
# 1                                   XY42 -1.0404215
# 2   SAMD11:XY20:XY29:XY34:XY82:XY88:XY94 -0.4563033
# 3                XY17:XY23:XY35:XY36:XY8 -0.2677627
# 4         MUTYH:XY43:XY62:XY85:XY91:XY92  0.7581688
# 5  AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95 -1.8944023
# 6                XY2:XY22:XY28:XY69:XY77  0.4681580
# 7              AARS1:XY11:XY17:XY62:XY75  0.1269098
# 8           XY25:XY46:XY47:XY6:XY76:XY84 -0.8524057
# 9          XY22:XY31:XY36:XY48:XY51:XY65 -0.9170595
# 10                                  XY36 -0.4759546

you could split the genes at : using strsplit and, first, match them with your genes vector.

## all genes from d
d.genes.0 <- sort(unique(unlist(strsplit(d$gene, "\:"))))
## genes from d existing in `genes` vector `as.numeric`.
d.genes.1 <- as.numeric(na.omit(match(genes, d.genes.0)))

Then second, we convert the splitted genes (as above) to factors and use d.genes.0 as the factor levels; exploiting numerical conversion of factors we finally match on numbers rather than on strings.

rw <- sapply(strsplit(d$gene, "\:"), function(x) 
  any(d.genes.1 %in% as.numeric(factor(x, levels=d.genes.0))))
d[rw, ]
#                                    gene          x
# 2  SAMD11:XY20:XY29:XY34:XY82:XY88:XY94 -0.4563033
# 4        MUTYH:XY43:XY62:XY85:XY91:XY92  0.7581688
# 5 AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95 -1.8944023
# 7             AARS1:XY11:XY17:XY62:XY75  0.1269098

Tested with > 4k genes and 50k rows, should work.

Data:

d <- structure(list(gene = c("XY42", "SAMD11:XY20:XY29:XY34:XY82:XY88:XY94", 
"XY17:XY23:XY35:XY36:XY8", "MUTYH:XY43:XY62:XY85:XY91:XY92", 
"AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95", "XY2:XY22:XY28:XY69:XY77", 
"AARS1:XY11:XY17:XY62:XY75", "XY25:XY46:XY47:XY6:XY76:XY84", 
"XY22:XY31:XY36:XY48:XY51:XY65", "XY36"), x = c(-1.04042150945666, 
-0.4563032693248, -0.267762662765083, 0.758168827559491, -1.89440229591065, 
0.468157951289336, 0.126909754004865, -0.852405668800981, -0.917059466430073, 
-0.475954635098868)), class = "data.frame", row.names = c(NA, 
-10L))

detect pattern in a data.frame column with string_detect too slow in R

Answers (1)

Related Questions