drstevok
drstevok

Reputation: 715

What is the most efficient way to replace values in a data.table using a dictionary?

I have a pretty large data.table (1e7 by 50 columns). Apart from the key, the columns are logical. Here a mini-version.

library(data.table)
d.in <- data.table(
  id=c(rep(1:2,each=4)), 
  time=c(1:4,1:4),
  drugA=c(T,T,F,T, F,F,F,F),
  drugB=c(F,F,F,F, T,F,F,F)
  )
setkey(d.in, id,time)

which looks like this

id  time    drugA   drugB
1   1   TRUE    FALSE
1   2   TRUE    FALSE
1   3   FALSE   FALSE
1   4   TRUE    FALSE
2   1   FALSE   TRUE
2   2   FALSE   FALSE
2   3   FALSE   FALSE
2   4   FALSE   FALSE

I have a key-value 'dictionary'

kv <- c(drugA=1, drugB=2)

And I want to use this dictionary to replace values in the logical columns with the values in the 'dictionary'. The output should look like this.

id  time    drugA   drugB
1   1   1   NA
1   2   1   NA
1   3   NA  NA
1   4   1   NA
2   1   NA  2
2   2   NA  NA
2   3   NA  NA
2   4   NA  NA

What is the most efficient (quickest) way of doing this?

Update

I have tried the solutions below, and can't find a big difference (although I am not sure that my approach to comparing is valid).

library(microbenchmark)
d.orig <- data.table(
  id=c(rep(1:2,each=1e7)), 
  time=c(1e7,1e7),
  drugA=sample(c(T,F), 2e7, replace=T),
  drugB=sample(c(T,F), 2e7, replace=T)
  )

# Solution 1
foo1 <- function() {
  d.in <- data.table::copy(d.orig)
  d.in[, names(kv) := lapply(names(kv), function(x) {
    gx <- get(x)
    replace(NA_real_[seq_along(gx)], gx, kv[x])
    })]
}

# Solution 2
dt_kv <- data.table(drug = c("drugA","drugB"), value = c(1,2))
foo2 <- function() {
  d.in <- data.table::copy(d.orig)
  d.in <- melt(d.in, id.vars = c("id", "time"))[ 
    dt_kv, on = c(variable = "drug"), nomatch = 0][
    value == FALSE, i.value := NA]

  dcast(d.in, formula = id + time ~ variable, value.var = "i.value")
}

# Solution 3
kDT = data.table(variable = names(kv), value = TRUE, v = unname(kv))
foo3 <- function() {
  d.in <- data.table::copy(d.orig)
  DT = melt(d.in, id=c("id","time"))
  DT[kDT, on=.(variable, value), v := i.v ]
  dcast(DT, formula = id + time ~ variable, value.var = 'v')
}

Which produces (albeit with a lot of variation still)

print(microbenchmark(foo1, times=1e4))
Unit: nanoseconds
 expr min lq    mean median uq   max neval
 foo1  33 50 85.8657     55 58 56717 10000

print(microbenchmark(foo2, times=1e4))
Unit: nanoseconds
 expr min lq    mean median uq   max neval
 foo2  29 48 70.8304     52 55 57644 10000

print(microbenchmark(foo3, times=1e4))
Unit: nanoseconds
 expr min lq    mean median uq   max neval
 foo3  30 36 61.1542     41 48 58015 10000

Upvotes: 1

Views: 1525

Answers (3)

Frank
Frank

Reputation: 66819

A common way would start with moving your data to long format:

DT = melt(d.in, id=c("id","time"))

Then put the mapping in a table, similar to @SymbolixAU's answer:

kDT = data.table(variable = names(kv), value = TRUE, v = unname(kv))

Then use the mapping to do an update join, adding a new column by reference:

DT[kDT, on=.(variable, value), v := i.v ]

Generally, I think if you care a lot about speed or simple syntax, you'll want long-format data rather than wide in R, so I would skip the final dcast step (see @SymbolixAU's answer).

Upvotes: 4

SymbolixAU
SymbolixAU

Reputation: 26248

You can create a data.table lookup dictionary, melt the original d.in, join, update, and reshape as required

dt_kv <- data.table(drug = c("drugA","drugB"),
                    value = c(1,2))

d.in <- melt(d.in, id.vars = c("id", "time"))[ 
  dt_kv, on = c(variable = "drug"), nomatch = 0][
    value == FALSE, i.value := NA]

dcast(d.in, formula = id + time ~ variable, value.var = "i.value")

#    id time drugA drugB
# 1:  1    1     1    NA
# 2:  1    2     1    NA
# 3:  1    3    NA    NA
# 4:  1    4     1    NA
# 5:  2    1    NA     2
# 6:  2    2    NA    NA
# 7:  2    3    NA    NA
# 8:  2    4    NA    NA

Upvotes: 2

Rich Scriven
Rich Scriven

Reputation: 99331

Not sure if it's the most efficient way, but you could do

d.in[, names(kv) := lapply(names(kv), function(x) {
        gx <- get(x)
        replace(NA_real_[seq_along(gx)], gx, kv[x])
    })]

Here we iterate over the names in kv, using get to retrieve the column values. Then we replace the relevant values of a newly created NA vector with our kv values, resulting in

   id time drugA drugB
1:  1    1     1    NA
2:  1    2     1    NA
3:  1    3    NA    NA
4:  1    4     1    NA
5:  2    1    NA     2
6:  2    2    NA    NA
7:  2    3    NA    NA
8:  2    4    NA    NA

Upvotes: 3

Related Questions