Alex Holcombe
Alex Holcombe

Reputation: 2593

Split-apply-combine with function that returns multiple variables

I need to apply myfun to subsets of a dataframe and include the results as new columns in the dataframe returned. In the old days, I used ddply. But in dplyr, I believe summarise is used for that, like this:

myfun<- function(x,y) {
  df<- data.frame( a= mean(x)*mean(y), b= mean(x)-mean(y) )           
  return (df)
}

mtcars %>%
  group_by(cyl) %>%
  summarise(a = myfun(cyl,disp)$a, b = myfun(cyl,disp)$b)

The above code works, but the myfun I'll be using is computationally very expensive, so I want it to be called only once rather than separately for the a and b columns. Is there a way to do that in dplyr?

Upvotes: 1

Views: 863

Answers (3)

akrun
akrun

Reputation: 887078

We can use data.table

library(data.table)
setDT(mtcars)[, myfun(cyl, disp), cyl] 
#    cyl         a         b
#1:   6 1099.8857 -177.3143
#2:   4  420.5455 -101.1364
#3:   8 2824.8000 -345.1000

Upvotes: 2

www
www

Reputation: 39154

do is not necessarily going to improve the speed. In this post, I am going to introduce a way to design a function performing the same task, and then do a benchmarking to compare the performance of each method.

Here is an alternative way to define the function.

myfun2 <- function(dt, x, y){
  x <- enquo(x)
  y <- enquo(y)

  dt2 <- dt %>%
    summarise(a = mean(!!x) * mean(!!y), b = mean(!!x) - mean(!!y))
  return(dt2)
}

Notice that the first argument of myfun2 is dt, which is the input data frame. By doing this, myfun2 can successfully implement as a part of the pipe operation.

mtcars %>%
  group_by(cyl) %>%
  myfun2(x = cyl, y = disp)
# A tibble: 3 x 3
    cyl         a         b
  <dbl>     <dbl>     <dbl>
1     4  420.5455 -101.1364
2     6 1099.8857 -177.3143
3     8 2824.8000 -345.1000

By doing this, we don't have to call my_fun each time when we want to create a new column. So this method is probably more efficient than my_fun.

Here is a comparison of the performance using the microbenchmark. The methods I compared are listed as follows. I ran the simulation 1000 times.

m1: OP's original way to apply `myfun`  
m2: Psidom's method, using `do`to apply `myfun`.  
m3: My approach, using `myfun2`  
m4: Using `do` to apply `myfun2`  
m5: Z.Lin's suggestion, directly calculating the values without defining a function.
m6: akrun's `data.table` approach with `myfun`

Here is the code for benchmarking.

microbenchmark(m1 = (mtcars %>%
                       group_by(cyl) %>%
                       summarise(a = myfun(cyl, disp)$a, b = myfun(cyl, disp)$b)),
               m2 = (mtcars %>% 
                       group_by(cyl) %>% 
                       do(myfun(.$cyl, .$disp))),
               m3 = (mtcars %>%
                       group_by(cyl) %>%
                       myfun2(x = cyl, y = disp)),
               m4 = (mtcars %>%
                       group_by(cyl) %>%
                       do(myfun2(., x = cyl, y = disp))),
               m5 = (mtcars %>% 
                       group_by(cyl) %>% 
                       summarise(a = mean(cyl) * mean(disp), b = mean(cyl) - mean(disp))),
               m6 = (as.data.table(mtcars)[, myfun(cyl, disp), cyl]),
               times = 1000)

And here is the result of benchmarking.

Unit: milliseconds
 expr       min        lq      mean    median        uq        max neval
   m1  7.058227  7.692654  9.429765  8.375190 10.570663  28.730059  1000
   m2  8.559296  9.381996 11.643645 10.500100 13.229285  27.585654  1000
   m3  6.817031  7.445683  9.423832  8.085241 10.415104 193.878337  1000
   m4 21.787298 23.995279 28.920262 26.922683 31.673820 177.004151  1000
   m5  5.337132  5.785528  7.120589  6.223339  7.810686  23.231274  1000
   m6  1.320812  1.540199  1.919222  1.640270  1.935352   7.622732  1000

The result shows that the do method (m2 and m4) are actually slower than their counterparts(m1 and m3). In this situation, applying myfun (m1) and myfun2 (m3) is faster than using do. myfun2 (m3) is slighly faster than myfun (m1). However, without defining any functions (m5) is actually faster than all the function-defined method (m1 to m4), suggesting that for this particular case, there is actually no need to define a fucntion. Finally, if there is no need to stay in tidyverse, or the size of the dataset is enormous. We can consider the data.table approach (m6), which is a lot faster than all the tidyverse solutions listed here.

Upvotes: 3

akuiper
akuiper

Reputation: 214957

Since your function returns a data frame, you can call your function within group_by %>% do which applies the function to each individual group and rbind the returned data frame together:

mtcars %>% group_by(cyl) %>% do(myfun(.$cyl, .$disp))

# A tibble: 3 x 3
# Groups:   cyl [3]
#    cyl         a         b
#  <dbl>     <dbl>     <dbl>
#1     4  420.5455 -101.1364
#2     6 1099.8857 -177.3143
#3     8 2824.8000 -345.1000

Upvotes: 3

Related Questions