Reputation: 1446
I have a data.frame that have 50,000 rows and 194 columns. In one of the columns named "Gene" there are one or multiple entries, always following the same pattern, e.g. "gene1" or "gene1;gene2" or "gene1:gene2:gene3". Then I have a character vector with a regular expression pattern very long e.g "\bgene1$|\bgene2$|\bgene3$|\bgene4$..." up to a total of 4,000 patterns, i.e, 4,000 \bgene$.
I want to find the matches of that pattern in the column Gene
of my data.frame
Here is an example of the code I am using so far
I cannot output my whole data.frame because it is too long
genes <- c("AARS", "AARS1", "SAMD11", "MUTYH", "PEGX", "BRCA1", "APC") # my real number of genes is 3,000
# then I converted the genes' vector to a regexp
genes2 <- paste0("\\b", genes, "\\b")
# then I try the matching
matches <- unique(grep(paste(genes2, collapse = "|"), # tib is my data.frame and Gene the column with the values I want to match
tib$Gene, value = TRUE, perl = FALSE))
# And finally filtering the data.frame
tib2 <- tib %>% dplyr::filter(Gene %in% matches)
However, when I use my real data, I get an out of memory error with the grep (setting perl=FALSE
), so I have tried with stringr
library but it is too slow to complete the search:
test <- str_extract_all(tib$Gene.refGene, paste(genes2,collapse="|"))
test2 <- str_detect((tib$Gene.refGene, paste(genes2,collapse="|"))
Both test
and test2
are too slow
Any hint on how to update
An example with less rows would be like this, courtesy of @jay.sf
d <- structure(list(gene = c("XY42", "SAMD11:XY20:XY29:XY34:XY82:XY88:XY94",
"XY17:XY23:XY35:XY36:XY8", "MUTYH:XY43:XY62:XY85:XY91:XY92",
"AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95", "XY2:XY22:XY28:XY69:XY77",
"AARS1:XY11:XY17:XY62:XY75", "XY25:PEGX:XY47:XY6:XY76:XY84",
"APC:XY31:XY36:XY48:XY51:XY65", "BRCA1"), x = c(-1.04042150945666,
-0.4563032693248, -0.267762662765083, 0.758168827559491, -1.89440229591065,
0.468157951289336, 0.126909754004865, -0.852405668800981, -0.917059466430073,
-0.475954635098868)), class = "data.frame", row.names = c(NA,
-10L))
And the list of genes is fixed genes <- c("AARS", "AARS1", "SAMD11", "MUTYH", "PEGX", "BRCA1", "APC"
. I want to find the exact match between members of gene list and genes in Gene
column, i.e, BRCA1 (in gene list) should match only BRCA1 not BRCA11 in Gene
column in the data.frame.
But bear in mind that my real gene list has 4,000 genes and my data.frame is composed of 50,000 rows
Upvotes: 0
Views: 523
Reputation: 72693
I'm not sure about your input and output. But assuming data like this,
d
# gene x
# 1 XY42 -1.0404215
# 2 SAMD11:XY20:XY29:XY34:XY82:XY88:XY94 -0.4563033
# 3 XY17:XY23:XY35:XY36:XY8 -0.2677627
# 4 MUTYH:XY43:XY62:XY85:XY91:XY92 0.7581688
# 5 AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95 -1.8944023
# 6 XY2:XY22:XY28:XY69:XY77 0.4681580
# 7 AARS1:XY11:XY17:XY62:XY75 0.1269098
# 8 XY25:XY46:XY47:XY6:XY76:XY84 -0.8524057
# 9 XY22:XY31:XY36:XY48:XY51:XY65 -0.9170595
# 10 XY36 -0.4759546
you could split the genes at :
using strsplit
and, first, match
them with your genes
vector.
## all genes from d
d.genes.0 <- sort(unique(unlist(strsplit(d$gene, "\\:"))))
## genes from d existing in `genes` vector `as.numeric`.
d.genes.1 <- as.numeric(na.omit(match(genes, d.genes.0)))
Then second, we convert the splitted genes (as above) to factor
s and use d.genes.0
as the factor levels; exploiting numerical conversion of factors
we finally match on numbers rather than on strings.
rw <- sapply(strsplit(d$gene, "\\:"), function(x)
any(d.genes.1 %in% as.numeric(factor(x, levels=d.genes.0))))
d[rw, ]
# gene x
# 2 SAMD11:XY20:XY29:XY34:XY82:XY88:XY94 -0.4563033
# 4 MUTYH:XY43:XY62:XY85:XY91:XY92 0.7581688
# 5 AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95 -1.8944023
# 7 AARS1:XY11:XY17:XY62:XY75 0.1269098
Tested with > 4k genes and 50k rows, should work.
Data:
d <- structure(list(gene = c("XY42", "SAMD11:XY20:XY29:XY34:XY82:XY88:XY94",
"XY17:XY23:XY35:XY36:XY8", "MUTYH:XY43:XY62:XY85:XY91:XY92",
"AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95", "XY2:XY22:XY28:XY69:XY77",
"AARS1:XY11:XY17:XY62:XY75", "XY25:XY46:XY47:XY6:XY76:XY84",
"XY22:XY31:XY36:XY48:XY51:XY65", "XY36"), x = c(-1.04042150945666,
-0.4563032693248, -0.267762662765083, 0.758168827559491, -1.89440229591065,
0.468157951289336, 0.126909754004865, -0.852405668800981, -0.917059466430073,
-0.475954635098868)), class = "data.frame", row.names = c(NA,
-10L))
Upvotes: 0