Chris
Chris

Reputation: 6362

Test two columns of strings for match row-wise in R

Let's say I have two columns of strings:

library(data.table)
DT <- data.table(x = c("a","aa","bb"), y = c("b","a","bbb"))

For each row, I want to know whether the string in x is present in column y. A looping approach would be:

for (i in 1:length(DT$x)){
  DT$test[i] <- DT[i,grepl(x,y) + 0]
}

DT
    x   y test
1:  a   b    0
2: aa   a    0
3: bb bbb    1

Is there a vectorized implementation of this? Using grep(DT$x,DT$y) only uses the first element of x.

Upvotes: 4

Views: 3707

Answers (5)

Chris
Chris

Reputation: 6362

Thank you all for your responses. I've benchmarked them all, and come up with the following:

library(data.table)
library(microbenchmark)

DT <- data.table(x = rep(c("a","aa","bb"),1000), y = rep(c("b","a","bbb"),1000))

DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
DT4 <- copy(DT)

microbenchmark(
DT1[, test := grepl(x, y), by = x]
,
DT2$test <- apply(DT, 1, function(x) grepl(x[1], x[2]))
,
DT3$test <- mapply(grepl, pattern=DT3$x, x=DT3$y)
,
{vgrepl <- Vectorize(grepl)
DT4[, test := as.integer(vgrepl(x, y))]}
)

Results

Unit: microseconds
                                                                               expr       min        lq       mean     median        uq        max neval
                                             DT1[, `:=`(test, grepl(x, y)), by = x]   758.339   908.106   982.1417   959.6115  1035.446   1883.872   100
                            DT2$test <- apply(DT, 1, function(x) grepl(x[1], x[2])) 16840.818 18032.683 18994.0858 18723.7410 19578.060  23730.106   100
                              DT3$test <- mapply(grepl, pattern = DT3$x, x = DT3$y) 14339.632 15068.320 16907.0582 15460.6040 15892.040 117110.286   100
 {     vgrepl <- Vectorize(grepl)     DT4[, `:=`(test, as.integer(vgrepl(x, y)))] } 14282.233 15170.003 16247.6799 15544.4205 16306.560  26648.284   100

Along with being the most syntactically simple, the data.table solution is also the fastest.

Upvotes: 2

David Arenburg
David Arenburg

Reputation: 92282

You can simply do

DT[, test := grepl(x, y), by = x]

Upvotes: 8

Rorschach
Rorschach

Reputation: 32426

Or mapply (Vectorize is really just a wrapper for mapply)

DT$test <- mapply(grepl, pattern=DT$x, x=DT$y)

Upvotes: 2

talat
talat

Reputation: 70256

You can use Vectorize:

vgrepl <- Vectorize(grepl)
DT[, test := as.integer(vgrepl(x, y))]
DT
    x   y test
1:  a   b    0
2: aa   a    0
3: bb bbb    1

Upvotes: 1

Rudisco
Rudisco

Reputation: 11

You can pass the grepl function into an apply function to operate on each row of your data table where the first column contains the string to search for and the second column contains the string to search in. This should give you a vectorized solution to your problem.

> DT$test <- apply(DT, 1, function(x) as.integer(grepl(x[1], x[2])))
> DT
    x   y test
1:  a   b    0
2: aa   a    0
3: bb bbb    1

Upvotes: 1

Related Questions