Data.table - subsetting within groups during group by is slow

Question

I'm trying to produce several aggregate statistics, and some of them need to be produced on a subset of each group. The data.table is quite large, 10 million rows, but using by without column subsetting is blazing fast (less than a second). Adding just one additional column which needs to be calculated on a subset of each group increases the running time by factor of 12.
Is the a faster way to do this? Below is my full code.

library(data.table)
library(microbenchmark)

N = 10^7

DT = data.table(id1 = sample(1:400, size = N, replace = TRUE),
                id2 = sample(1:100, size = N, replace = TRUE),
                id3 = sample(1:50, size = N, replace = TRUE),
                filter_var = sample(1:10, size = N, replace = TRUE),
                x1 = sample(1:1000, size = N, replace = TRUE),
                x2 = sample(1:1000, size = N, replace = TRUE),
                x3 = sample(1:1000, size = N, replace = TRUE),
                x4 = sample(1:1000, size = N, replace = TRUE),
                x5 = sample(1:1000, size = N, replace = TRUE) )

setkey(DT, id1,id2,id3)

microbenchmark( 
  DT[, .(
    sum_x1 = sum(x1),
    sum_x2 = sum(x2),
    sum_x3 = sum(x3),
    sum_x4 = sum(x4),
    sum_x5 = sum(x5),
    avg_x1 = mean(x1),
    avg_x2 = mean(x2),
    avg_x3 = mean(x3),
    avg_x4 = mean(x4),
    avg_x5 = mean(x5)
  ) , by = c('id1','id2','id3')]  , unit = 's', times = 10L)
      min        lq     mean    median       uq      max neval
 0.942013 0.9566891 1.004134 0.9884895 1.031334 1.165144    10


microbenchmark(    DT[, .(
  sum_x1 = sum(x1),
  sum_x2 = sum(x2),
  sum_x3 = sum(x3),
  sum_x4 = sum(x4),
  sum_x5 = sum(x5),
  avg_x1 = mean(x1),
  avg_x2 = mean(x2),
  avg_x3 = mean(x3),
  avg_x4 = mean(x4),
  avg_x5 = mean(x5),
  sum_x1_F1 = sum(x1[filter_var < 5]) #this line slows everything down
) , by = c('id1','id2','id3')]  , unit = 's', times = 10L)

      min      lq     mean   median       uq      max neval
 12.24046 12.4123 12.83447 12.72026 13.49059 13.61248    10

Frank · Accepted Answer

GForce makes grouped operations run faster and will work on expressions like list(x = funx(X), y = funy(Y)), ...) where X and Y are column names and funx and funy belong to the set of optimized functions.

For a full description of what works, see ?GForce.
To test if an expression works, read the messages from DT[, expr, by=, verbose=TRUE].

In the OP's case, we have sum_x1_F1 = sum(x1[filter_var < 5]) which is not covered by GForce even though sum(v) is. In this special case, we can make a var v = x1*condition and sum that:

DT[, v := x1*(filter_var < 5)]

system.time(    DT[, .(
  sum_x1 = sum(x1),
  sum_x2 = sum(x2),
  sum_x3 = sum(x3),
  sum_x4 = sum(x4),
  sum_x5 = sum(x5),
  avg_x1 = mean(x1),
  avg_x2 = mean(x2),
  avg_x3 = mean(x3),
  avg_x4 = mean(x4),
  avg_x5 = mean(x5),
  sum_x1_F1 = sum(v)
) , by = c('id1','id2','id3')])
#    user  system elapsed 
#    0.63    0.19    0.81

For comparison, timing the OP's code on my computer:

system.time(    DT[, .(
  sum_x1 = sum(x1),
  sum_x2 = sum(x2),
  sum_x3 = sum(x3),
  sum_x4 = sum(x4),
  sum_x5 = sum(x5),
  avg_x1 = mean(x1),
  avg_x2 = mean(x2),
  avg_x3 = mean(x3),
  avg_x4 = mean(x4),
  avg_x5 = mean(x5),
  sum_x1_F1 = sum(x1[filter_var < 5]) #this line slows everything down
) , by = c('id1','id2','id3')])
#    user  system elapsed 
#    9.00    0.02    9.06

Data.table - subsetting within groups during group by is slow

Answers (1)

Related Questions