rkraft
rkraft

Reputation: 555

Filter conditions based on a list of column in data.table

Problem: Suppose I have the following data.table object. I want to apply the following filter condition:

(CMT_1 != "") | (CMT_2 != "") | (CMT_3 != "")

As there can be more CMT_* and I do not know them a-prioi I want to make this condition flexible (depending on how many CMT_* columns I have). Any suggestions how to write this nicely?

library(data.table)

dt <- data.table(
  CMT_1 = 1:3,
  CMT_2 = 4:6,
  CMT_4 = 8:10,
  remainder1 = 12:14,
  remainder2 = 15:17
)

cmts <- names(dt)[startsWith(names(dt), "CMT_")]

## filter condition which I want to make flexible
dt[(CMT_1 != "") | (CMT_2 != "") | (CMT_3 != ""))

Upvotes: 0

Views: 305

Answers (2)

MichaelChirico
MichaelChirico

Reputation: 34763

You can build the condition programmatically, then evaluate it:

cols = grep('^CMT_', names(dt), value = TRUE)
cond = Reduce(
  function(x, y) call('|', x, y), 
  lapply(cols, function(col) call('nzchar', as.name(col)))
)
dt[eval(cond)]

building an expression with call and name can take some getting used to, it might be easier to use str2lang more directly here:

cond = str2lang(
  paste(sprintf('nzchar(%s)', cols), collapse = ' | ')
)
cond
# nzchar(CMT_1) | nzchar(CMT_2) | nzchar(CMT_4)

Or you can evaluate the columns one by one and Reduce using | alone:

Reduce(
  '|', 
  lapply(cols, function(col) nzchar(dt[[col]]))
)

The disadvantage of the rowSums approach is matrix conversion -- your whole table has to be converted to matrix first. nzchar is also a faster version of x != ''.

Upvotes: 1

chinsoon12
chinsoon12

Reputation: 25223

Here is one option using rowSums and .I to extract those rows before subsetting:

cmts <- grep("^CMT_", names(dt), value=TRUE)
dt[dt[, .I[rowSums(.SD!="") > 1L], .SDcols=cmts]]

Upvotes: 0

Related Questions