Reputation: 73
My code sample is below. I think it probably explains things better than I can. I get why this doesn't work--R perfoms the boolean operation on the column names, not the values in the columns, but I'm not sure how to make it work.
DT = data.table( a = 1:5,
b = 6:10,
a_valid = c(0,1,1,0,0),
b_valid = c(1,1,0,0,0)
)
# This works
DT[a_valid == 0, a := NA]
numeric_columns <- c('a', 'b')
binary_columns <- c('a_valid', 'b_valid')
# This doesn't.
DT[binary_columns == 0, numeric_columns := NA]
Upvotes: 0
Views: 277
Reputation: 6226
I would add to @sindri_baldur solution the possibility to use lapply
:
lapply(seq_along(numeric_columns), function(i) DT[get(binary_columns[i]) == 0, (numeric_columns[i]) := NA])
It will avoid the overhead of the for
loops.
A little benchmark can be of help to choose the best solution
library(data.table)
DT = data.table(a = 1:1e5,
b = 1:1e5 + 1e5,
a_valid = sample(c(0,1), size = 1e5, replace = TRUE),
b_valid = sample(c(0,1), size = 1e5, replace = TRUE)
)
numeric_columns <- c('a', 'b')
binary_columns <- c('a_valid', 'b_valid')
dt2 <- copy(DT)
dt3 <- copy(DT)
dt4 <- copy(DT)
microbenchmark::microbenchmark(
for (i in seq_along(numeric_columns)) {
dt2[get(binary_columns[i]) == 0, (numeric_columns[i]) := NA]
},
lapply(seq_along(numeric_columns), function(i) dt3[get(binary_columns[i]) == 0, (numeric_columns[i]) := NA]),
for(j in 1:2) {
i1 <- which(dt4[[j]] == 0)
set(
dt4,
i = which(dt4[[binary_columns[i]]] == 0),
j = numeric_columns[i],
value = NA_integer_
)
}
)
# min lq mean median uq max neval
# 9.962940 10.104035 11.278033 10.226006 10.555132 22.10373 100
# 4.453995 4.535093 7.726525 4.659652 4.830672 234.04730 100
# 11.781060 11.913439 13.056660 12.021012 12.365140 26.84604 100
The winner is the lapply
solution in that scenario. If you need that kind of thing on more than two columns, the set
solution will probably be better
Upvotes: 1
Reputation: 33498
You could use a loop:
for (i in seq_along(numeric_columns)) {
DT[get(binary_columns[i]) == 0, (numeric_columns[i]) := NA]
}
Which should be slightly faster with set()
:
for (i in seq_along(numeric_columns)) {
set(
DT,
i = which(DT[[binary_columns[i]]] == 0),
j = numeric_columns[i],
value = NA_integer_
)
}
Or switching to base R for a second:
setDF(DT)
DT[numeric_columns][DT[binary_columns] == 0] <- NA
setDT(DT)
Upvotes: 2