Unexpected behavior in dplyr::group_by_ and dplyr::summarise_

Question

I wrote this little function to find the R-squared value of a regression performed on two variables in the mtcars data set, which is included in R by default:

get_r_squared = function(x) summary(lm(mpg ~ hp, data = x))$r.squared

It seems to work as expected when I give it the full data set:

get_r_squared(mtcars)
# [1] 0.6024373

However, if I try to use it as part of a dplyr pipeline on a subset of the data, it returns the same answer as above three times when I expected it to return a different value for each subset.

library(dplyr)

mtcars %>% 
  group_by_("cyl") %>% 
  summarise_(r_squared = get_r_squared(.))

## Source: local data frame [3 x 2]
## 
##   cyl r_squared
## 1   4 0.6024373
## 2   6 0.6024373
## 3   8 0.6024373

I was expecting the values to look like this instead

sapply(
  unique(mtcars$cyl),
  function(cyl){
    get_r_squared(mtcars[mtcars$cyl == cyl, ])
  }
)
# [1] 0.01614624 0.27405583 0.08044919

I've confirmed that this is not a plyr namespace issue: that package is not loaded.

search() 

##  [1] ".GlobalEnv"        "package:knitr"     "package:dplyr"    
##  [4] "tools:rstudio"     "package:stats"     "package:graphics" 
##  [7] "package:grDevices" "package:utils"     "package:datasets" 
## [10] "package:methods"   "Autoloads"         "package:base"

I'm not sure what's going on here. Could it be related to nonstandard evaluation in the lm function? Or am I just misunderstanding how group_by works? Or perhaps something else?

hadley · Accepted Answer

I think you've misunderstood how summarise() works - it doesn't do anything with ., and the fact that it works at all is just happy chance. Instead, try something like this:

library(dplyr)
get_r_squared <- function(x, y) summary(lm(x ~ y))$r.squared
mtcars %>% 
  group_by(cyl) %>% 
  summarise(r_squared = get_r_squared(mpg, wt))

Unexpected behavior in dplyr::group_by_ and dplyr::summarise_

Answers (2)

Related Questions