Eric Fail
Eric Fail

Reputation: 7958

dplyr and aggregation with summarise; a simple way to get mean at diffrent levels of aggregation

I am interested in the total mean, and the mean within different conditions, of some measurements preferably using dplyr's summarise function.

I'll illustrate my question in the following. Say I have some data, borrowed form this this,

dta <- read.table(header=TRUE, text='
 subject sex condition measurement
       1   M   control         7.9
       1   M     cond1        12.3
       1   M     cond2        10.7
       2   F   control         6.3
       2   F     cond1        10.6
       2   F     cond2        11.1
       3   F   control         9.5
       3   F     cond1        13.1
       3   F     cond2        13.8
       4   M   control        11.5
       4   M     cond1        13.4
       4   M     cond2        12.9
') # ; dta

I now want the mean for each sex and the mean by sex for each condition. I know how to get it for each condition, like this.

# install.packages(c("dplyr"), dependencies = TRUE)
library(dplyr) 
dta %>% 
  group_by(sex, condition) %>%
  summarise(
    mean = mean(measurement)
)
#> # A tibble: 6 x 3
#> # Groups:   sex [?]
#>      sex condition  mean
#>   <fctr>    <fctr> <dbl>
#> 1      F     cond1 11.85
#> 2      F     cond2 12.45
#> 3      F   control  7.90
#> 4      M     cond1 12.85
#> 5      M     cond2 11.80
#> 6      M   control  9.70

But, this does not give me the aggregate mean for both sexes. To get this I either have to run a separate call, i.e.

dta %>% 
  group_by(sex) %>%
  summarise(
    mean = mean(measurement)
)
#> # A tibble: 2 x 2
#>      sex     mean
#>   <fctr>    <dbl>
#> 1      F 10.73333
#> 2      M 11.45000

or deconstruct data structure. Like this,

# install.packages(c("tidyr"), dependencies = TRUE)
library(tidyr)
dta_wide <- spread(dta, condition, measurement)

dta_wide %>% 
  group_by(sex) %>%
  summarise(
    mean_tot = mean(cond1 + cond2 + control)/3,
    mean_cond1 = mean(cond1),
    mean_cond2 = mean(cond2),
    mean_control = mean(control)        
)
#> # A tibble: 2 x 5
#>      sex mean_tot mean_cond1 mean_cond2 mean_control
#>   <fctr>    <dbl>      <dbl>      <dbl>        <dbl>
#> 1      F 10.73333      11.85      12.45          7.9
#> 2      M 11.45000      12.85      11.80          9.7

This gives me an output with both the over all mean by sex and the individual mean by condition.

However, both running two separate calls and deconstructing data seems unnecessarily cumbersome. Isn't there a simply way to add a categorical variable, here condition, as the by variable and at the same time keep the aggregate information, here mean by sex? Maybe I am overlooking something logical and shouldn't be messing with data like this?

Upvotes: 2

Views: 1204

Answers (1)

akuiper
akuiper

Reputation: 215137

One option is to calculate the two summaries separately, then join back:

dta %>% 
    group_by(sex, condition) %>% 
    summarise(mean = mean(measurement)) %>% 
    inner_join(
        group_by(dta, sex) %>% 
        summarise(mean_tot = mean(measurement))
    )

# Joining, by = "sex"
# A tibble: 6 x 4
# Groups:   sex [?]
#     sex condition  mean mean_tot
#  <fctr>    <fctr> <dbl>    <dbl>
#1      F     cond1 11.85 10.73333
#2      F     cond2 12.45 10.73333
#3      F   control  7.90 10.73333
#4      M     cond1 12.85 11.45000
#5      M     cond2 11.80 11.45000
#6      M   control  9.70 11.45000

Or use group_by twice:

dta %>% 
    group_by(sex, condition) %>% 
    summarise(s = sum(measurement), n = n()) %>% 
    group_by(sex) %>% 
    transmute(condition, mean_tot = sum(s) / sum(n), mean = s / n)

# Adding missing grouping variables: `sex`
# A tibble: 6 x 4
# Groups:   sex [2]
#     sex condition mean_tot  mean
#  <fctr>    <fctr>    <dbl> <dbl>
#1      F     cond1 10.73333 11.85
#2      F     cond2 10.73333 12.45
#3      F   control 10.73333  7.90
#4      M     cond1 11.45000 12.85
#5      M     cond2 11.45000 11.80
#6      M   control 11.45000  9.70

Upvotes: 2

Related Questions