Alexey Ferapontov
Alexey Ferapontov

Reputation: 5169

Usage of multiple output function with ddply

I have a function that returns more than one value. I need to use it in ddply but I want to avoid calling the function multiple times. Here's a mock-up example:

library(plyr)

ff = function(i) {
  return(c(min(i),max(i)))
}

set.seed(12345)
id = c(rep(1:3,4))
x  = sample(1:10, 12, replace=T)
df = data.frame(id,x)

res = ddply(df,.(id),summarise,val1 = min(x), val2 = max(x), val3 = ff(x)[1], val4 = ff(x)[2])
View(res)


    id  val1    val2    val3    val4
1   1   4   10  4   10
2   2   1   9   1   9
3   3   2   8   2   8

As expected, val3 = val1, and val4 = val2. But I have to call function ff two times in ddply, which is not optimal time-wise. Is there a way to assign val within ddply with both function outputs in one access? If I try to use [1:2] or similar, I get an error: Error in eval(expr, envir, enclos) : length(rows) == 1 is not TRUE

Thanks!

Edit. Thanks to all contributors! David's solution worked ~2 times faster. And it allows one to do further operations with intermediate results. Here's an updated code that is fully reproducible.

library(plyr)
library(data.table)
library(microbenchmark)

ff = function(i) {
  return(c(min(i),max(i)))
}

set.seed(12345)
id = c(rep(1:3,4000))
x  = runif(12000,1,10)
df = data.frame(id,id2,x)
View(df)

res  = ddply(df,.(id),summarise,val1 = min(x), val2 = max(x), val3 = ff(x)[1], val4 = ff(x)[2], val5 = val3+val4, val6 = val3/val4)
View(res)

res2 = setDT(df)[, as.list(c(val1 = min(x), val2 = max(x), val3 = ff(x))), .(id)][, val5 := val31+val32][, val6 := val31/val32]
View(res2)

print(microbenchmark(ddply(df,.(id),summarise,val1 = min(x), val2 = max(x), val3 = ff(x)[1], val4 = ff(x)[2], val5 = val3+val4, val6 = val3/val4), times = 100))
print(microbenchmark(setDT(df)[, as.list(c(val1 = min(x), val2 = max(x), val3 = ff(x))), .(id)][, val5 := val31+val32][, val6 := val31/val32],times=100))

Results:

Unit: milliseconds
                                                                                                                                   expr
 ddply(df, .(id), summarise, val1 = min(x), val2 = max(x), val3 = ff(x)[1],      val4 = ff(x)[2], val5 = val3 + val4, val6 = val3/val4)
      min       lq     mean   median       uq     max neval
 3.042616 3.185358 5.976851 3.409828 3.925104 45.5157   100
Unit: milliseconds
                                                                                                                                    expr
 setDT(df)[, as.list(c(val1 = min(x), val2 = max(x), val3 = ff(x))),      .(id)][, `:=`(val5, val31 + val32)][, `:=`(val6, val31/val32)]
      min       lq     mean   median       uq      max neval
 1.968349 2.071747 2.285368 2.124206 2.251171 12.62967   100

Upvotes: 3

Views: 1551

Answers (1)

IRTFM
IRTFM

Reputation: 263332

If you construct your function to return a named vector, then data.table will accept it and populate the columns with those names retruning the desired structure:

require(data.table)
 ff = function(i) {
   return(c(val3=min(i),val4=max(i)))
   }
setDT(df)[, as.list(c(var1 = min(x), var2 = max(x),  ff(x))), id]
#-----------
   id var1 var2 val3 val4
1:  1    4   10    4   10
2:  2    1    9    1    9
3:  3    2    8    2    8

Upvotes: 1

Related Questions