rsoren
rsoren

Reputation: 4206

Assign multiple variables that reference each other, using data.table

I'd like to assign a value to a variable, then use that variable to create a new variable. The syntax for data.table supports multiple assignment, but apparently not with internal references. The "i" and "by" clauses in my real use-case are more complicated, so I'd prefer not to have repeating code like this:

require(data.table)

dt <- data.table(
  x = 1:5, 
  y = 2:6
)

# this works
dt[x == 3, z1 := x + y]
dt[x == 3, z2 := z1 + 5]

# but I wish this worked
dt[x == 3, `:=`(
  z1 = x + y,
  z2 = z1 + 5
)]

In contrast, this works in dplyr:

require(dplyr)

df <- data.frame(
  x = 1:5, 
  y = 2:6
)

df <- mutate(df,
  z1 = x + y,
  z2 = z1 + 5
)

Is there a clean way to do this using data.table?

EDIT: Tweaking akrun's solution slightly, I figured out a way to keep the readable, sequential syntax I was looking for. It's just doing all of the operations outside the list:

dt[x==3, c('z1','z2','z3') := {
  z1 <- x+y
  z2 <- z1 + 5
  z3 <- z2 + 6
  list(z1, z2, z3) 
}]

Upvotes: 4

Views: 488

Answers (1)

akrun
akrun

Reputation: 886938

We can use curly brackets to create the temporary variables, then place them in a list along with the calculation based on that variable, assign (:=) to the columns we need to create.

dt[x==3, c('z1', 'z2') := {
             z1 <- x+y
             list(z1, z1+5) 
             }]
dt
#   x y z1 z2
#1: 1 2 NA NA
#2: 2 3 NA NA
#3: 3 4  7 12
#4: 4 5 NA NA
#5: 5 6 NA NA

To make it a bit more faster, we can use setkey

setkey(dt, x)[(3),  c('z1', 'z2') := {
                                   z1 <- x+y
                              list(z1, z1+5)
                  }]

Benchmarks

set.seed(24)
dt1 <- data.table(x = sample(1:9, 1e8, replace=TRUE), y = sample(5:9, 1e8, replace=TRUE))

dt2 <- copy(dt1)
dt3 <- copy(dt1)

akrun1 <- function(){dt1[x==3, c('z1', 'z2') := {
             z1 <- x+y
                 list(z1, z1+5) 
             }]
   }

akrun2 <- function() {setkey(dt3, x)[(3),  c('z1', 'z2') := {
                                   z1 <- x+y
                              list(z1, z1+5)
                  }]
}


rsoren  <- function() {
    dt2[x == 3, z1 := x + y]
    dt2[x == 3, z2 := z1 + 5]
        }



library(microbenchmark)
microbenchmark(akrun1(), akrun2(), rsoren(), unit= "relative", times = 20L)
#Unit: relative
#     expr      min       lq     mean   median       uq       max neval
# akrun1() 1.597267 1.605404 1.393016 1.642584 1.538929 0.8634406    20
# akrun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000    20
# rsoren() 2.584153 2.586185 2.179601 2.694469 2.468219 0.9740701    20

Upvotes: 4

Related Questions