jeanlain
jeanlain

Reputation: 418

R finding string patterns of a vector in elements of another (by pairs)

I have a character vector

c1 <- c("BEL","BEL","BEL","BEL")

and another character vector of same length

c2 <- c(" BEL-65_DRe-I_1p:BEL;_LTR_Retrotransposon;_Transposable_Element;_Nonautonomous;_BEL-65_DRe-I", "L1-2_NN_3p:L1;_Non-LTR_Retrotransposon;_Transposable_Element;_L1-2_NN", "BEL-13_CQ-I_1p:BEL;_LTR_Retrotransposon;_Transposable_Element;_BEL-13_CQ_;_BEL-13_CQ-LTR;_BEL-13_CQ-I", "BEL-31_CQ-I_1p:BEL;_LTR_Retrotransposon;_Transposable_Element;_BEL-31_CQ_;_BEL-31_CQ-LTR;_BEL-31_CQ-I", "Gypsy-22_CQ-I_1p:Gypsy;_LTR_Retrotransposon;_Transposable_Element;_Gypsy-22_CQ_;_Gypsy-22_CQ-LTR;_Gypsy-22_CQ-I")

I want to know if each string in c1 is found in c2 at the same index (ignoring case), i.e. if c1[1] is found inc2[1], c1[2] in c2[2], and so on. In practice, the vectors can have millions of elements.

My current solution is

test <- Map(function(x,y) grepl(x,y, ignore.case = T), c1, c2)

But it's not vectorised, hence relatively slow. Is there a better solution?

Upvotes: 1

Views: 1480

Answers (3)

User2321
User2321

Reputation: 3062

You could try the following using the stringr package:

require(stringr)
require(data.table)

data <- data.table(c1, c2)
data[, FOUND:= str_detect(toupper(c2), toupper(c1))]

Upvotes: 3

lukeA
lukeA

Reputation: 54277

This runs quite fast:

library(stringi)
c1 <- stri_rand_strings(1e6, 2)
c2 <- paste0(stri_rand_strings(1e6, 20), tolower(c1))
system.time(res <- stri_detect(c2, fixed = c1, case_insensitive = TRUE))
       # User      System verstrichen 
       # 0.73        0.00        0.75

Partly, because I did not check for a regular expression pattern but for a constant string (fixed), which you could also use in grep*.

Upvotes: 4

Benjamin Mohn
Benjamin Mohn

Reputation: 301

What would work as well, as your solutions is to use apply. For this small example it works well, if it will be faster for bigger data, I do not know.

apply(rbind(c1,c2), 2, function(y){grepl(pattern = y[1],x=y[2], ignore.case = T)})
[1]  TRUE FALSE  TRUE  TRUE FALSE 

Edited: I had to add one more "BEL" to make it work, because your c1 consists of 4 elements and c2 of 5

Upvotes: 1

Related Questions