Reputation: 1334
I'm trying to produce several aggregate statistics, and some of them need to be produced on a subset of each group. The data.table is quite large, 10 million rows, but using by
without column subsetting is blazing fast (less than a second). Adding just one additional column which needs to be calculated on a subset of each group increases the running time by factor of 12.
Is the a faster way to do this? Below is my full code.
library(data.table)
library(microbenchmark)
N = 10^7
DT = data.table(id1 = sample(1:400, size = N, replace = TRUE),
id2 = sample(1:100, size = N, replace = TRUE),
id3 = sample(1:50, size = N, replace = TRUE),
filter_var = sample(1:10, size = N, replace = TRUE),
x1 = sample(1:1000, size = N, replace = TRUE),
x2 = sample(1:1000, size = N, replace = TRUE),
x3 = sample(1:1000, size = N, replace = TRUE),
x4 = sample(1:1000, size = N, replace = TRUE),
x5 = sample(1:1000, size = N, replace = TRUE) )
setkey(DT, id1,id2,id3)
microbenchmark(
DT[, .(
sum_x1 = sum(x1),
sum_x2 = sum(x2),
sum_x3 = sum(x3),
sum_x4 = sum(x4),
sum_x5 = sum(x5),
avg_x1 = mean(x1),
avg_x2 = mean(x2),
avg_x3 = mean(x3),
avg_x4 = mean(x4),
avg_x5 = mean(x5)
) , by = c('id1','id2','id3')] , unit = 's', times = 10L)
min lq mean median uq max neval
0.942013 0.9566891 1.004134 0.9884895 1.031334 1.165144 10
microbenchmark( DT[, .(
sum_x1 = sum(x1),
sum_x2 = sum(x2),
sum_x3 = sum(x3),
sum_x4 = sum(x4),
sum_x5 = sum(x5),
avg_x1 = mean(x1),
avg_x2 = mean(x2),
avg_x3 = mean(x3),
avg_x4 = mean(x4),
avg_x5 = mean(x5),
sum_x1_F1 = sum(x1[filter_var < 5]) #this line slows everything down
) , by = c('id1','id2','id3')] , unit = 's', times = 10L)
min lq mean median uq max neval
12.24046 12.4123 12.83447 12.72026 13.49059 13.61248 10
Upvotes: 1
Views: 391
Reputation: 66819
GForce makes grouped operations run faster and will work on expressions like list(x = funx(X), y = funy(Y)), ...)
where X
and Y
are column names and funx
and funy
belong to the set of optimized functions.
?GForce
. DT[, expr, by=, verbose=TRUE]
.In the OP's case, we have sum_x1_F1 = sum(x1[filter_var < 5])
which is not covered by GForce even though sum(v)
is. In this special case, we can make a var v = x1*condition and sum that:
DT[, v := x1*(filter_var < 5)]
system.time( DT[, .(
sum_x1 = sum(x1),
sum_x2 = sum(x2),
sum_x3 = sum(x3),
sum_x4 = sum(x4),
sum_x5 = sum(x5),
avg_x1 = mean(x1),
avg_x2 = mean(x2),
avg_x3 = mean(x3),
avg_x4 = mean(x4),
avg_x5 = mean(x5),
sum_x1_F1 = sum(v)
) , by = c('id1','id2','id3')])
# user system elapsed
# 0.63 0.19 0.81
For comparison, timing the OP's code on my computer:
system.time( DT[, .(
sum_x1 = sum(x1),
sum_x2 = sum(x2),
sum_x3 = sum(x3),
sum_x4 = sum(x4),
sum_x5 = sum(x5),
avg_x1 = mean(x1),
avg_x2 = mean(x2),
avg_x3 = mean(x3),
avg_x4 = mean(x4),
avg_x5 = mean(x5),
sum_x1_F1 = sum(x1[filter_var < 5]) #this line slows everything down
) , by = c('id1','id2','id3')])
# user system elapsed
# 9.00 0.02 9.06
Upvotes: 5