Lin Jing
Lin Jing

Reputation: 159

Why does the ``mean`` function not work properly with ``group_by %>% summarise`` in a function environement?

For example:

df <- data.frame("Treatment" = c(rep("A", 2), rep("B", 2)), "Price" = 1:4, "Cost" = 2:5)

I want to summarize the data by treatments for all the variables I have, and put them together, so I define a function to do this for each variable first, and then rbind them later on.

SummarizeFn <- function(x,y,z) {
                       df1 <- x %>% group_by(Treatment) %>% 
                       summarize(n = n(), Mean = mean(y), SD = sd(y)) %>% 
                       df1$Var = z # add a column to show which variable those statistics belong to. 
                   }
SumPrice <- SummarizeFn(df, df$Price, "Price")

However, the results are:

  Treatment     n  Mean    SD Var  
  <fct>     <int> <dbl> <dbl> <chr>
1 A             2   2.5  1.29 Price
2 B             2   2.5  1.29 Price

They are the mean and sd of all the observations, but not the grouped observations by Treatment. What is the problem here?

If I take the code out of the function environment, it works totally fine. Please help, thanks.

If you have a better way to achieve my purpose, that would be great! Thanks!

Upvotes: 1

Views: 672

Answers (2)

linog
linog

Reputation: 6226

This is related to the question of standard evaluation. That's funny, I just wrote an article on the subject. This is quite hard to pass string names with dplyr. If you need to do that, use rlang::sym (or rlang::syms) and !! (or !!!)

Regarding your problem, I think data.table offers you a concise solution

dt <- as.data.table(mtcars)
output <- dt[,lapply(.SD, function(d) return(list(.N,mean(d),sd(d)))),
   .SDcols = c("mpg","qsec")]
output[,'stat' := c("observations","mean","sd")]
output

# output
#    mpg     qsec         stat
# 1:       32       32 observations
# 2: 20.09062 17.84875         mean
# 3: 6.026948 1.786943           sd

I propose an anonymous function with lapply but you could use a more sophisticated function defined before the summary step. Change the .SDcols to include more variables if needed

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

When you use variables with $ in dplyr pipes they do not respect grouping and work as if they are applied to the entire dataframe. Apart from that, you can use {{}} to evaluate column names in the functions.

library(dplyr)

SummarizeFn <- function(x,y,z) {
  x %>% 
    group_by(Treatment) %>% 
    summarize(n = n(), Mean = mean({{y}}), SD = sd({{y}}), Var = z)
}

SummarizeFn(df, Price, "Price")

#  Treatment     n  Mean    SD Var  
#  <fct>     <int> <dbl> <dbl> <chr>
#1 A             2   1.5 0.707 Price
#2 B             2   3.5 0.707 Price

Upvotes: 1

Related Questions