Nested group_by operation in dplyr: does the second call include the first call?

Question

In my data below, First, I'm want to group_by(study), and get the mean of X for each unique study value and subtract it from each X value in each study.

Second, and while groupe_by(study) is still in effect, I want to further group_by(outcome) within each study and get the mean of X for unique outcome value within a unique study value and subtract it from each X value in each outcome in each study.

I'm using the following workaround, but it seems it doesn't achieve my goal, because it seems the the group_by(outcome) call is ignoring the previous group_by(study).

Is there a way to achieve what I described above?

library(dplyr)

set.seed(0)
(data <- expand.grid(study = 1:2, outcome = rep(1:2,2)))
data$X <- rnorm(nrow(data))
(data <- arrange(data,study))

#  study outcome          X
#1     1       1  1.2629543
#2     1       2  1.3297993
#3     1       1  0.4146414
#4     1       2 -0.9285670
#5     2       1 -0.3262334
#6     2       2  1.2724293
#7     2       1 -1.5399500
#8     2       2 -0.2947204


data %>% 
  group_by(study) %>% 
  mutate(X_between_st = mean(X), X_within_st = X-X_between_st) %>%
  group_by(outcome) %>%
  mutate(X_between_ou = mean(X), X_within_ou = X-X_between_ou)

Ronak Shah · Accepted Answer

Yes, the second group_by overwrites the previous group_by which can be checked with group_vars function.

library(dplyr)

data %>% 
  group_by(study) %>% 
  mutate(X_between_st = mean(X), X_within_st = X-X_between_st) %>%
  group_by(outcome) %>%
  group_vars()

#[1] "outcome"

As you can see at this stage the data is grouped only by outcome.

You can achieve your goal by including .add = TRUE in group_by which will add to the existing groups.

data %>% 
  group_by(study) %>% 
  mutate(X_between_st = mean(X), X_within_st = X-X_between_st) %>%
  group_by(outcome, .add = TRUE) %>%
  group_vars()

#[1] "study"   "outcome"

So ultimately, now the code would become -

data %>% 
  group_by(study) %>% 
  mutate(X_between_st = mean(X), X_within_st = X-X_between_st) %>%
  group_by(outcome, .add = TRUE) %>%
  mutate(X_between_ou = mean(X), X_within_ou = X-X_between_ou)

#  study outcome      X X_between_st X_within_st X_between_ou X_within_ou
#                                     
#1     1       1  1.26         0.520      0.743         0.839       0.424
#2     1       2  1.33         0.520      0.810         0.201       1.13 
#3     1       1  0.415        0.520     -0.105         0.839      -0.424
#4     1       2 -0.929        0.520     -1.45          0.201      -1.13 
#5     2       1 -0.326       -0.222     -0.104        -0.933       0.607
#6     2       2  1.27        -0.222      1.49          0.489       0.784
#7     2       1 -1.54        -0.222     -1.32         -0.933      -0.607
#8     2       2 -0.295       -0.222     -0.0726        0.489      -0.784

Nested group_by operation in dplyr: does the second call include the first call?

Answers (2)

Related Questions