Reputation: 7113
Let's say I have data.table
looking like this:
dt <- data.table(
a = c( "A", "B", "C", "C" ),
b = c( "U", "V", "W", "X" ),
c = c( 0.1, 0.2, 0.3, 0.4 ),
min = c( 0, 1, 2, 3 ),
max = c( 11, 12, 13, 14 ),
val = c( 100, 200, 300, 400 ),
key = "a"
)
My actual data.table
has much more columns and up to a couple of million rows. About 10% of the rows have a duplicated key a
. Those rows I'd like to aggregate with a function looking like this one:
comb <- function( x ){
k <- which.max( x[ ,c ] )
list( b = x[ k, b ], c = x[ k, c ], min = min( x[ , min ] ), max = max( x[ , max ] ), val = sum( x[ ,val ] ) )
}
However, calling
dt <- dt[ , comb(.SD), by = a ]
is very slow and I'm wondering how I could improve this. Any help is appreciated.
Upvotes: 2
Views: 160
Reputation: 269461
By placing c
in the key and using .N
to get the maximum we can avoid which.max
(untested):
setkey(dt, a, c)
dt[, c(.SD[.N], min = min[1], val = sum(val)), by = a][, -c(4, 6)]
ADDED: or this variation:
dt[, c(.SD[.N, c(1:2, 4)], min = min[1], val = sum(val)), by = a]
ADDED 2: We only used .SD
because you indicated you had many columns but if you are willing to write them out then the above could be written:
dt[, list(b = b[.N], c = c[.N], min = min[1], max = max[.N], val = sum(val)), by = a]
ADDED 3: Yet another variation:
dt[, c("min", "val") := list(min[1], sum(val)), by = a][, .SD[.N], by = a]
Microbenchmarking the four solutions gave the following boxplot (n = 10):
Upvotes: 2