user3226167
user3226167

Reputation: 3439

How to write reusable functions for columns in by group operations in data.table?

There are some columns(~20) I need in many data.tables, how do I encapsulate the operations in a function?

For example, I want to have column a1 and a2 in every data.table, the fastest method is to copy and paste codes:

n= 10
m = 2
d = data.table( p = c(1:n)*1.0, q = 1:m)
dnew = d[, list(a1 = mean(p),a2 = max(p), b = 2) , by = q] #copy and paste

I want write reusable functions like this,

f <- function(d) with(d, list( a1 = mean(p), a2 = max(p))) #return list
dnew = d[, c(f(.SD), list( b = 2)) , by = q]

or this,

g <- function(d)d[, list(a1 = mean(p), a2 = max(p)), by = q] #return data.table
dnew1 = g(d)
dnew2 = d[, list(b = 2),by = q]
dnew = merge(dnew1, dnew2, by = "q")

However, both are very slow when number of groups(m) is very large.

Upvotes: 1

Views: 86

Answers (1)

Frank
Frank

Reputation: 66819

Well, you can follow the metaprogramming help from FAQ 1.6:

# expression instead of a function
fe = quote(list(a1 = mean(p), a2 = max(p)))

# add another element
e = fe
e$b = 2

# eval following FAQ
d[, eval(e), by=q]

I borrowed the e$b = 2 syntax from Hadley Wickham's notes on expressions.

This does work, but looking at d[, eval(e), by=q, verbose=TRUE] we see that max is not getting optimized. Since b is just a constant, I'd add it in a second step:

extrae = quote(`:=`(b = 2))
d[, eval(fe), by=q][, eval(extrae)][]

# or if working interactively...
d[, eval(fe), by=q][, b := 2][]

With verbose=TRUE, we'll now see that fe is optimized to list(gmean(p), gmax(p)).

Upvotes: 5

Related Questions