Reputation: 715
I have a pretty large data.table (1e7 by 50 columns). Apart from the key, the columns are logical. Here a mini-version.
library(data.table)
d.in <- data.table(
id=c(rep(1:2,each=4)),
time=c(1:4,1:4),
drugA=c(T,T,F,T, F,F,F,F),
drugB=c(F,F,F,F, T,F,F,F)
)
setkey(d.in, id,time)
which looks like this
id time drugA drugB
1 1 TRUE FALSE
1 2 TRUE FALSE
1 3 FALSE FALSE
1 4 TRUE FALSE
2 1 FALSE TRUE
2 2 FALSE FALSE
2 3 FALSE FALSE
2 4 FALSE FALSE
I have a key-value 'dictionary'
kv <- c(drugA=1, drugB=2)
And I want to use this dictionary to replace values in the logical columns with the values in the 'dictionary'. The output should look like this.
id time drugA drugB
1 1 1 NA
1 2 1 NA
1 3 NA NA
1 4 1 NA
2 1 NA 2
2 2 NA NA
2 3 NA NA
2 4 NA NA
What is the most efficient (quickest) way of doing this?
I have tried the solutions below, and can't find a big difference (although I am not sure that my approach to comparing is valid).
library(microbenchmark)
d.orig <- data.table(
id=c(rep(1:2,each=1e7)),
time=c(1e7,1e7),
drugA=sample(c(T,F), 2e7, replace=T),
drugB=sample(c(T,F), 2e7, replace=T)
)
# Solution 1
foo1 <- function() {
d.in <- data.table::copy(d.orig)
d.in[, names(kv) := lapply(names(kv), function(x) {
gx <- get(x)
replace(NA_real_[seq_along(gx)], gx, kv[x])
})]
}
# Solution 2
dt_kv <- data.table(drug = c("drugA","drugB"), value = c(1,2))
foo2 <- function() {
d.in <- data.table::copy(d.orig)
d.in <- melt(d.in, id.vars = c("id", "time"))[
dt_kv, on = c(variable = "drug"), nomatch = 0][
value == FALSE, i.value := NA]
dcast(d.in, formula = id + time ~ variable, value.var = "i.value")
}
# Solution 3
kDT = data.table(variable = names(kv), value = TRUE, v = unname(kv))
foo3 <- function() {
d.in <- data.table::copy(d.orig)
DT = melt(d.in, id=c("id","time"))
DT[kDT, on=.(variable, value), v := i.v ]
dcast(DT, formula = id + time ~ variable, value.var = 'v')
}
Which produces (albeit with a lot of variation still)
print(microbenchmark(foo1, times=1e4))
Unit: nanoseconds
expr min lq mean median uq max neval
foo1 33 50 85.8657 55 58 56717 10000
print(microbenchmark(foo2, times=1e4))
Unit: nanoseconds
expr min lq mean median uq max neval
foo2 29 48 70.8304 52 55 57644 10000
print(microbenchmark(foo3, times=1e4))
Unit: nanoseconds
expr min lq mean median uq max neval
foo3 30 36 61.1542 41 48 58015 10000
Upvotes: 1
Views: 1525
Reputation: 66819
A common way would start with moving your data to long format:
DT = melt(d.in, id=c("id","time"))
Then put the mapping in a table, similar to @SymbolixAU's answer:
kDT = data.table(variable = names(kv), value = TRUE, v = unname(kv))
Then use the mapping to do an update join, adding a new column by reference:
DT[kDT, on=.(variable, value), v := i.v ]
Generally, I think if you care a lot about speed or simple syntax, you'll want long-format data rather than wide in R, so I would skip the final dcast
step (see @SymbolixAU's answer).
Upvotes: 4
Reputation: 26248
You can create a data.table
lookup dictionary, melt the original d.in
, join, update, and reshape as required
dt_kv <- data.table(drug = c("drugA","drugB"),
value = c(1,2))
d.in <- melt(d.in, id.vars = c("id", "time"))[
dt_kv, on = c(variable = "drug"), nomatch = 0][
value == FALSE, i.value := NA]
dcast(d.in, formula = id + time ~ variable, value.var = "i.value")
# id time drugA drugB
# 1: 1 1 1 NA
# 2: 1 2 1 NA
# 3: 1 3 NA NA
# 4: 1 4 1 NA
# 5: 2 1 NA 2
# 6: 2 2 NA NA
# 7: 2 3 NA NA
# 8: 2 4 NA NA
Upvotes: 2
Reputation: 99331
Not sure if it's the most efficient way, but you could do
d.in[, names(kv) := lapply(names(kv), function(x) {
gx <- get(x)
replace(NA_real_[seq_along(gx)], gx, kv[x])
})]
Here we iterate over the names in kv
, using get
to retrieve the column values. Then we replace the relevant values of a newly created NA vector with our kv
values, resulting in
id time drugA drugB 1: 1 1 1 NA 2: 1 2 1 NA 3: 1 3 NA NA 4: 1 4 1 NA 5: 2 1 NA 2 6: 2 2 NA NA 7: 2 3 NA NA 8: 2 4 NA NA
Upvotes: 3