Filtering data.table rows by the presence of a column in a strsplit of another column

Question

I have a data table:

dt <- data.table(col1=c('aa,bb', 'bb,cc,ee', 'dd,ee'), col2=c('aa', 'cc', 'aa'))
> dt
    col1      col2
1: aa,bb      aa
2: bb,cc,ee   cc
3: dd,ee      aa

I want to check if column 2 occurs in the strsplit of column one, so for the first row if aa is present in aa,bb split by a comma, which is true. It's also true for the second row, and false for the third. I only want to keep the rows where this occurs, so only row 1 and 2.

My first thought was doing it like this:

dt[col2 %in% strsplit(col1, ',')]

However, that returns an empty data.table.

I can think of multiple solutions to solve this, including making new columns using tstrsplit, or melting the data table, but all of these are a bit tedious for such a seemingly simple task. Any suggestions?

akrun · Accepted Answer

We can use str_detect from stringr

library(stringr)
dt[, flag := str_detect(col1, col2)]
dt
#       col1 col2  flag
#1:    aa,bb   aa  TRUE
#2: bb,cc,ee   cc  TRUE
#3:    dd,ee   aa FALSE

Also, to avoid any substring matches, we can specify the word boundary (\b)

dt[, str_detect(col1, str_c("\b", col2, "\b"))]
#[1]  TRUE  TRUE FALSE

Regarding the use of strsplit, the output would be a list of vectors. So, we need to use a function that checks the values of 'col1' are in the corresponding elements of list. Map does that

dt[,  unlist(Map(`%in%`, col2, strsplit(col1, ",")))]

To apply the filter in the same step and return the 2 row data.table:

dt[unlist(Map(`%in%`, col2, strsplit(col1, ",")))]

Filtering data.table rows by the presence of a column in a strsplit of another column

Answers (1)

Related Questions