Reputation: 2593
I need to apply myfun
to subsets of a dataframe and include the results as new columns in the dataframe returned. In the old days, I used ddply
. But in dplyr
, I believe summarise
is used for that, like this:
myfun<- function(x,y) {
df<- data.frame( a= mean(x)*mean(y), b= mean(x)-mean(y) )
return (df)
}
mtcars %>%
group_by(cyl) %>%
summarise(a = myfun(cyl,disp)$a, b = myfun(cyl,disp)$b)
The above code works, but the myfun
I'll be using is computationally very expensive, so I want it to be called only once rather than separately for the a
and b
columns. Is there a way to do that in dplyr
?
Upvotes: 1
Views: 863
Reputation: 887078
We can use data.table
library(data.table)
setDT(mtcars)[, myfun(cyl, disp), cyl]
# cyl a b
#1: 6 1099.8857 -177.3143
#2: 4 420.5455 -101.1364
#3: 8 2824.8000 -345.1000
Upvotes: 2
Reputation: 39154
do
is not necessarily going to improve the speed. In this post, I am going to introduce a way to design a function performing the same task, and then do a benchmarking to compare the performance of each method.
Here is an alternative way to define the function.
myfun2 <- function(dt, x, y){
x <- enquo(x)
y <- enquo(y)
dt2 <- dt %>%
summarise(a = mean(!!x) * mean(!!y), b = mean(!!x) - mean(!!y))
return(dt2)
}
Notice that the first argument of myfun2
is dt
, which is the input data frame. By doing this, myfun2
can successfully implement as a part of the pipe operation.
mtcars %>%
group_by(cyl) %>%
myfun2(x = cyl, y = disp)
# A tibble: 3 x 3
cyl a b
<dbl> <dbl> <dbl>
1 4 420.5455 -101.1364
2 6 1099.8857 -177.3143
3 8 2824.8000 -345.1000
By doing this, we don't have to call my_fun
each time when we want to create a new column. So this method is probably more efficient than my_fun
.
Here is a comparison of the performance using the microbenchmark
. The methods I compared are listed as follows. I ran the simulation 1000 times.
m1: OP's original way to apply `myfun`
m2: Psidom's method, using `do`to apply `myfun`.
m3: My approach, using `myfun2`
m4: Using `do` to apply `myfun2`
m5: Z.Lin's suggestion, directly calculating the values without defining a function.
m6: akrun's `data.table` approach with `myfun`
Here is the code for benchmarking.
microbenchmark(m1 = (mtcars %>%
group_by(cyl) %>%
summarise(a = myfun(cyl, disp)$a, b = myfun(cyl, disp)$b)),
m2 = (mtcars %>%
group_by(cyl) %>%
do(myfun(.$cyl, .$disp))),
m3 = (mtcars %>%
group_by(cyl) %>%
myfun2(x = cyl, y = disp)),
m4 = (mtcars %>%
group_by(cyl) %>%
do(myfun2(., x = cyl, y = disp))),
m5 = (mtcars %>%
group_by(cyl) %>%
summarise(a = mean(cyl) * mean(disp), b = mean(cyl) - mean(disp))),
m6 = (as.data.table(mtcars)[, myfun(cyl, disp), cyl]),
times = 1000)
And here is the result of benchmarking.
Unit: milliseconds
expr min lq mean median uq max neval
m1 7.058227 7.692654 9.429765 8.375190 10.570663 28.730059 1000
m2 8.559296 9.381996 11.643645 10.500100 13.229285 27.585654 1000
m3 6.817031 7.445683 9.423832 8.085241 10.415104 193.878337 1000
m4 21.787298 23.995279 28.920262 26.922683 31.673820 177.004151 1000
m5 5.337132 5.785528 7.120589 6.223339 7.810686 23.231274 1000
m6 1.320812 1.540199 1.919222 1.640270 1.935352 7.622732 1000
The result shows that the do
method (m2
and m4
) are actually slower than their counterparts(m1
and m3
). In this situation, applying myfun
(m1
) and myfun2
(m3
) is faster than using do
. myfun2
(m3
) is slighly faster than myfun
(m1
). However, without defining any functions (m5
) is actually faster than all the function-defined method (m1
to m4
), suggesting that for this particular case, there is actually no need to define a fucntion. Finally, if there is no need to stay in tidyverse
, or the size of the dataset is enormous. We can consider the data.table
approach (m6
), which is a lot faster than all the tidyverse
solutions listed here.
Upvotes: 3
Reputation: 214957
Since your function returns a data frame, you can call your function within group_by %>% do
which applies the function to each individual group and rbind the returned data frame together:
mtcars %>% group_by(cyl) %>% do(myfun(.$cyl, .$disp))
# A tibble: 3 x 3
# Groups: cyl [3]
# cyl a b
# <dbl> <dbl> <dbl>
#1 4 420.5455 -101.1364
#2 6 1099.8857 -177.3143
#3 8 2824.8000 -345.1000
Upvotes: 3