Tony2016
Tony2016

Reputation: 267

Speeding up data.table SD[,]

I have code that use the below snippet a fair bit as i want the most recent record after grouping by groups group1, group2, group3 and group4

dt[order(-date), .SD[1,], by=.(group1, group2, group3, group4)]

The issue is the code is extremely slow. Is there a faster alternative?

Upvotes: 2

Views: 143

Answers (1)

akrun
akrun

Reputation: 887088

Using head would improve the speed.

dt[order(-date), head(.SD, 1), by=.(group1, group2, group3, group4)]

Or another option is .I that can speed up the process significantly

dt[dt[order(-date), .I[1L], by=.(group1, group2, group3, group4)]$V1]

Benchmarks

set.seed(234)
dt <- data.table(group1 = sample(letters, 1e6, replace=TRUE),
              group2 = sample(letters, 1e6, replace=TRUE),
              group3 = sample(1:10, 1e6, replace=TRUE), 
              group4 = sample(5:15, 1e6, replace=TRUE),
              date = sample(1:30, 1e6, replace=TRUE), 
              value = rnorm(1e6))
system.time(dt[order(-date), .SD[1,], by=.(group1, group2, group3, group4)])
#   user  system elapsed 
#  29.63   15.65   46.17 

 system.time(dt[order(-date), head(.SD, 1), by=.(group1, group2, group3, group4)])
 #  user  system elapsed 
 #  0.22    0.00    0.14 
 system.time(dt[dt[order(-date), .I[1L], by=.(group1, group2, group3, group4)]$V1])
 #  user  system elapsed 
 #  0.23    0.02    0.15 

Also, from @Frank's suggestion

system.time(dt[order(-date), .SD[1L], by=.(group1, group2, group3, group4)])
#   user  system elapsed 
#  0.16    0.05    0.16 

Upvotes: 4

Related Questions