LucasMation
LucasMation

Reputation: 2489

Check if string appears in any column (except 1st one) of a data.table

Based on a data.table, DT, with m string columns, how can flag rows in which a string, let's say "BlaBla" appears in at least one column?

What I want is:

DT[, flag_BlaBLa:=(test if "BlaBLa" appears in any column except the first one)]

I do not want to make explicit reference to column names. Rather I need to refer to col 2 to the max col number in DT (perhaps: 2:ncol(DT)). Why not? In reality, I have a list of data.frames, with varying number of columns in each. I need to create the flag, as indicated above, for all the data.frames in the list

Edit1: adding a reproducible example. Edit2: changing some 'BlaBLa' strings to 'BlaBLa+something'. This should also be matched considered a match as they include 'BlaBLa'.

DT starts with cols a,b,c and I want a script to create the flag_BlaBLa. Final result would be:

DT <- data.table(a=c("x","y","z",'w'),
                 b=c('BlaBLa','BLe','Bli','BlaBLaSometing'), 
                 c=c('Bla','BLe','BlaBLa','Blo'), 
                 flag_BlaBLa=c(T,F,T,T)
                 )

For the list, the final result should be:

DT1 <- data.table(a=c("x","y","z",'w'),
                 b=c('BlaBLa','BLe','Bli','BlaBLaSomething'), 
                 c=c('Bla','BLe','BlaBLa','Blo'), 
                 flag_BlaBLa=c(T,F,T,T)
                 )

DT2 <- data.table(a=c("q","j","p"),
                 b=c('BLe','Bli','BlaBLa'), 
                 flag_BlaBLa=c(F,F,T)
                 )
l <- list(DT1,DT2)

Upvotes: 1

Views: 721

Answers (4)

Ronak Shah
Ronak Shah

Reputation: 388982

We could loop over the list , select the columns and check if there is atleast one "BlaBLa" in any row and mark the row TRUE/FALSE.

library(data.table)
lapply(l, function(x) x[, flag_BlaBLa := rowSums(x[,2:ncol(x)] == "BlaBLa") > 0])


l
#[[1]]
#   a      b      c flag_BlaBLa
#1: x BlaBLa    Bla        TRUE
#2: y    BLe    BLe       FALSE
#3: z    Bli BlaBLa        TRUE
#4: w BlaBLa    Blo        TRUE

#[[2]]
#   a      b flag_BlaBLa
#1: q    BLe       FALSE
#2: j    Bli       FALSE
#3: p BlaBLa        TRUE

EDIT

If it's not an exact match and we need to find a pattern of that string we can need to loop over the columns as well (similar to @MichaelChirico) before taking rowSums

lapply(l, function(x) x[, flag_BlaBLa := rowSums(sapply(x[, 2:ncol(x)],
                      grepl, pattern = 'BlaBLa', fixed = TRUE)) > 0])


#[[1]]
#   a               b      c flag_BlaBLa
#1: x          BlaBLa    Bla        TRUE
#2: y             BLe    BLe       FALSE
#3: z             Bli BlaBLa        TRUE
#4: w BlaBLaSomething    Blo        TRUE

#[[2]]
#   a      b flag_BlaBLa
#1: q    BLe       FALSE
#2: j    Bli       FALSE
#3: p BlaBLa        TRUE

Upvotes: 1

akrun
akrun

Reputation: 887118

We can specify the columns of interest in .SDcols, loop through the Subset of data.table (.SD) check if it is equal 'BlaBLa', Reduce it to single logical vector to create the column

library(data.table)
lapply(l, function(x) x[, flag_BlaBLa := Reduce(`|`, lapply(.SD, `==`, 
           "BlaBLa")), .SDcols = 2:ncol(x)][])
#[[1]]
#   a      b      c flag_BlaBLa
#1: x BlaBLa    Bla        TRUE
#2: y    BLe    BLe       FALSE
#3: z    Bli BlaBLa        TRUE
#4: w BlaBLa    Blo        TRUE

#[[2]]
#   a      b flag_BlaBLa
#1: q    BLe       FALSE
#2: j    Bli       FALSE
#3: p BlaBLa        TRUE

Upvotes: 3

MichaelChirico
MichaelChirico

Reputation: 34703

I would use a for loop over the list and sapply over the columns, using .SDcols to exclude the first:

for (ii in seq_along(l)) {
  l[[ii]][ , .SDcols = -1L, 
          flag_BlaBLa := any(sapply(.SD, grepl, pattern = 'BlaBLa', fixed = TRUE))] 
}

Note that since you're not actually using any regular expression, fixed = TRUE is the more efficient option for using grepl. Drop fixed = TRUE if the pattern you're trying to detect is indeed a regular expression.

If not all of your columns are string columns, this could be made more efficient by making .SDcols more strict, e.g.

.SDcols = intersect(2:ncols(l[[ii]]), which(sapply(l[[ii]], is.character)))

(or possibly using is.character(x) || is.factor(x))

Upvotes: 1

samkart
samkart

Reputation: 6644

Here is an approach with .SDcols

require(dplyr)
require(data.table)
require(stringr)

DT <- DT[, key_ := do.call(paste, c(.SD, sep = "_")), .SDcols = 2:ncol(DT)]
DT <- DT[, has_blabla := as.integer(str_detect(key_, "BlaBla"))]

The first creates a key with all column values per row separated by '_'. Then the next searches it and flags as binary. The search returns TRUE or FALSE, which when coerced to integer is binary.

Upvotes: 1

Related Questions