gzm
gzm

Reputation: 73

Data.table conditional on list of columns

My code sample is below. I think it probably explains things better than I can. I get why this doesn't work--R perfoms the boolean operation on the column names, not the values in the columns, but I'm not sure how to make it work.

DT = data.table( a = 1:5,
                 b = 6:10,
                 a_valid = c(0,1,1,0,0),
                 b_valid = c(1,1,0,0,0)
)

# This works
DT[a_valid == 0, a := NA]

numeric_columns <- c('a', 'b')
binary_columns <- c('a_valid', 'b_valid')

# This doesn't.
DT[binary_columns == 0, numeric_columns := NA]

Upvotes: 0

Views: 277

Answers (2)

linog
linog

Reputation: 6226

I would add to @sindri_baldur solution the possibility to use lapply:

lapply(seq_along(numeric_columns), function(i) DT[get(binary_columns[i]) == 0, (numeric_columns[i]) := NA])

It will avoid the overhead of the for loops.

A little benchmark can be of help to choose the best solution

library(data.table)

DT = data.table(a = 1:1e5,
                b = 1:1e5 + 1e5,
                a_valid = sample(c(0,1), size = 1e5, replace = TRUE),
                b_valid = sample(c(0,1), size = 1e5, replace = TRUE)
)
numeric_columns <- c('a', 'b')
binary_columns <- c('a_valid', 'b_valid')

dt2 <- copy(DT)
dt3 <- copy(DT)
dt4 <- copy(DT)
microbenchmark::microbenchmark(
  for (i in seq_along(numeric_columns)) {
    dt2[get(binary_columns[i]) == 0, (numeric_columns[i]) := NA]
  },
  lapply(seq_along(numeric_columns), function(i) dt3[get(binary_columns[i]) == 0, (numeric_columns[i]) := NA]),
  for(j in 1:2) {
    i1 <- which(dt4[[j]] == 0)
    set(
      dt4, 
      i = which(dt4[[binary_columns[i]]] == 0), 
      j = numeric_columns[i], 
      value = NA_integer_
    )
  }  
)

#       min        lq      mean    median        uq       max neval
#  9.962940 10.104035 11.278033 10.226006 10.555132  22.10373   100
#  4.453995  4.535093  7.726525  4.659652  4.830672 234.04730   100
# 11.781060 11.913439 13.056660 12.021012 12.365140  26.84604   100

The winner is the lapply solution in that scenario. If you need that kind of thing on more than two columns, the set solution will probably be better

Upvotes: 1

s_baldur
s_baldur

Reputation: 33498

You could use a loop:

for (i in seq_along(numeric_columns)) {
  DT[get(binary_columns[i]) == 0, (numeric_columns[i]) := NA]
}

Which should be slightly faster with set():

for (i in seq_along(numeric_columns)) {
  set(
    DT, 
    i = which(DT[[binary_columns[i]]] == 0), 
    j = numeric_columns[i], 
    value = NA_integer_
  )
}

Or switching to base R for a second:

setDF(DT)
DT[numeric_columns][DT[binary_columns] == 0] <- NA
setDT(DT)

Upvotes: 2

Related Questions