Beasterfield
Beasterfield

Reputation: 7113

Improve aggregation of data.table

Let's say I have data.table looking like this:

dt <- data.table(
  a   = c( "A", "B", "C", "C" ),
  b   = c( "U", "V", "W", "X" ),
  c   = c( 0.1, 0.2, 0.3, 0.4 ),
  min = c( 0,   1,   2,   3 ),
  max = c( 11,  12,  13,  14 ),
  val = c( 100, 200, 300, 400 ),
  key = "a"
)

My actual data.table has much more columns and up to a couple of million rows. About 10% of the rows have a duplicated key a. Those rows I'd like to aggregate with a function looking like this one:

comb <- function( x ){
  k <- which.max( x[ ,c ]  )
  list( b = x[ k, b ], c = x[ k, c ], min = min( x[ , min ] ), max = max( x[ , max ] ), val = sum( x[ ,val ] ) )
}

However, calling

dt <- dt[ , comb(.SD), by = a ]

is very slow and I'm wondering how I could improve this. Any help is appreciated.

Upvotes: 2

Views: 160

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 269461

By placing c in the key and using .N to get the maximum we can avoid which.max (untested):

setkey(dt, a, c)
dt[, c(.SD[.N], min = min[1], val = sum(val)), by = a][, -c(4, 6)]

ADDED: or this variation:

dt[, c(.SD[.N, c(1:2, 4)], min = min[1], val = sum(val)), by = a]

ADDED 2: We only used .SD because you indicated you had many columns but if you are willing to write them out then the above could be written:

dt[, list(b = b[.N], c = c[.N], min = min[1], max = max[.N], val = sum(val)), by = a]

ADDED 3: Yet another variation:

dt[, c("min", "val") := list(min[1], sum(val)), by = a][, .SD[.N], by = a]

Benchmarks

Microbenchmarking the four solutions gave the following boxplot (n = 10):

enter image description here

Upvotes: 2

Related Questions