Alex
Alex

Reputation: 1304

Filter strings across multiple columns with data.table

I have a dataset that looks something like this.

df <- tibble::tribble(
  ~name,           ~x,  ~y,              ~z,  
  "N/A",            1,   "testSmith",    -100, 
  "N A",            3,   "NOt available", -99,
  "test Smith",     NA,  "test Smith",    -98,
  "Not Available", -99, "25",             -101,
  "test Smith",    -98, "28",             -1)

I would like to create a new data.table that keeps all the rows the string "test".

The final dataset should look something like this

  name           x y              z
  <chr>      <dbl> <chr>      <dbl>
1 N/A            1 testSmith   -100
2 test Smith    NA test Smith   -98
3 test Smith   -98 28            -1

I could do this column by column like this

setDT(df)[name%like%"test"|y%like%"test"]

The problem with this approach is that I have hundreds of string variables and I would like to find a more compact approach. I tried the followings but they do not work

chvar <- keep(trai,is.character)%>%names()
setDT(df)[chvar%like%"test"]#error
setDT(df)[(chvar)%like%"test"]#error
setDT(df)[.(chvar)%like%"test"]#empty dt

Does someone know how I could do it in a quick and efficient way?

Thanks a lot for your help

Upvotes: 3

Views: 1017

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 102710

Another data.table option is using grepl

dt[rowSums(dt[, lapply(.SD, grepl, pattern = "test")]) > 0, .SDcols = .(name, y)]

such that

         name   x          y    z
1:        N/A   1  testSmith -100
2: test Smith  NA test Smith  -98
3: test Smith -98         28   -1

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389275

In data.table you can do :

library(data.table)

cols <- c('name', 'y')
setDT(df)

df[df[, Reduce(`|`, lapply(.SD, `%like%`, "test")), .SDcols = cols]]

#         name   x          y    z
#1:        N/A   1  testSmith -100
#2: test Smith  NA test Smith  -98
#3: test Smith -98         28   -1

In base R :

subset(df, Reduce(`|`, lapply(df[cols], function(x) grepl('test', x))))

dplyr :

library(dplyr)
df %>% filter(Reduce(`|`, across(all_of(cols), ~grepl('test', .x))))

lapply/across returns a list of TRUE/FALSE values for all columns. It will return TRUE if 'test' is present and FALSE if it is not present. When we use it in combination with Reduce and | it will give TRUE only of there is atleast one TRUE value in the row. If all the values in the row are FALSE it will returns FALSE. We select only those rows which has at least one TRUE value in it.

Upvotes: 5

Related Questions