Reputation: 2489
Based on a data.table, DT, with m string columns, how can flag rows in which a string, let's say "BlaBla" appears in at least one column?
What I want is:
DT[, flag_BlaBLa:=(test if "BlaBLa" appears in any column except the first one)]
I do not want to make explicit reference to column names. Rather I need to refer to col 2 to the max col number in DT (perhaps: 2:ncol(DT)
). Why not? In reality, I have a list of data.frames, with varying number of columns in each. I need to create the flag, as indicated above, for all the data.frames in the list
Edit1: adding a reproducible example. Edit2: changing some 'BlaBLa' strings to 'BlaBLa+something'. This should also be matched considered a match as they include 'BlaBLa'.
DT starts with cols a,b,c
and I want a script to create the flag_BlaBLa
. Final result would be:
DT <- data.table(a=c("x","y","z",'w'),
b=c('BlaBLa','BLe','Bli','BlaBLaSometing'),
c=c('Bla','BLe','BlaBLa','Blo'),
flag_BlaBLa=c(T,F,T,T)
)
For the list, the final result should be:
DT1 <- data.table(a=c("x","y","z",'w'),
b=c('BlaBLa','BLe','Bli','BlaBLaSomething'),
c=c('Bla','BLe','BlaBLa','Blo'),
flag_BlaBLa=c(T,F,T,T)
)
DT2 <- data.table(a=c("q","j","p"),
b=c('BLe','Bli','BlaBLa'),
flag_BlaBLa=c(F,F,T)
)
l <- list(DT1,DT2)
Upvotes: 1
Views: 721
Reputation: 388982
We could loop over the list , select the columns and check if there is atleast one "BlaBLa" in any row and mark the row TRUE
/FALSE
.
library(data.table)
lapply(l, function(x) x[, flag_BlaBLa := rowSums(x[,2:ncol(x)] == "BlaBLa") > 0])
l
#[[1]]
# a b c flag_BlaBLa
#1: x BlaBLa Bla TRUE
#2: y BLe BLe FALSE
#3: z Bli BlaBLa TRUE
#4: w BlaBLa Blo TRUE
#[[2]]
# a b flag_BlaBLa
#1: q BLe FALSE
#2: j Bli FALSE
#3: p BlaBLa TRUE
EDIT
If it's not an exact match and we need to find a pattern of that string we can need to loop over the columns as well (similar to @MichaelChirico) before taking rowSums
lapply(l, function(x) x[, flag_BlaBLa := rowSums(sapply(x[, 2:ncol(x)],
grepl, pattern = 'BlaBLa', fixed = TRUE)) > 0])
#[[1]]
# a b c flag_BlaBLa
#1: x BlaBLa Bla TRUE
#2: y BLe BLe FALSE
#3: z Bli BlaBLa TRUE
#4: w BlaBLaSomething Blo TRUE
#[[2]]
# a b flag_BlaBLa
#1: q BLe FALSE
#2: j Bli FALSE
#3: p BlaBLa TRUE
Upvotes: 1
Reputation: 887118
We can specify the columns of interest in .SDcols
, loop through the Subset of data.table (.SD
) check if it is equal 'BlaBLa', Reduce
it to single logical vector
to create the column
library(data.table)
lapply(l, function(x) x[, flag_BlaBLa := Reduce(`|`, lapply(.SD, `==`,
"BlaBLa")), .SDcols = 2:ncol(x)][])
#[[1]]
# a b c flag_BlaBLa
#1: x BlaBLa Bla TRUE
#2: y BLe BLe FALSE
#3: z Bli BlaBLa TRUE
#4: w BlaBLa Blo TRUE
#[[2]]
# a b flag_BlaBLa
#1: q BLe FALSE
#2: j Bli FALSE
#3: p BlaBLa TRUE
Upvotes: 3
Reputation: 34703
I would use a for
loop over the list and sapply
over the columns, using .SDcols
to exclude the first:
for (ii in seq_along(l)) {
l[[ii]][ , .SDcols = -1L,
flag_BlaBLa := any(sapply(.SD, grepl, pattern = 'BlaBLa', fixed = TRUE))]
}
Note that since you're not actually using any regular expression, fixed = TRUE
is the more efficient option for using grepl
. Drop fixed = TRUE
if the pattern you're trying to detect is indeed a regular expression.
If not all of your columns are string columns, this could be made more efficient by making .SDcols
more strict, e.g.
.SDcols = intersect(2:ncols(l[[ii]]), which(sapply(l[[ii]], is.character)))
(or possibly using is.character(x) || is.factor(x)
)
Upvotes: 1
Reputation: 6644
Here is an approach with .SDcols
require(dplyr)
require(data.table)
require(stringr)
DT <- DT[, key_ := do.call(paste, c(.SD, sep = "_")), .SDcols = 2:ncol(DT)]
DT <- DT[, has_blabla := as.integer(str_detect(key_, "BlaBla"))]
The first creates a key with all column values per row separated by '_'. Then the next searches it and flags as binary. The search returns TRUE
or FALSE
, which when coerced to integer is binary.
Upvotes: 1