Reputation: 88
I would like to calculate summaries for different groups AND simultaneously calculate a summary for the overall (ungrouped) dataset, preferably using dplyr (or something that fits well into a dplyr pipeline).
The desired result can be achieved by separately calculating group summaries, then the overall summary, and then joining the results. However this seems a bit inefficient I am hoping there is a simpler solution that requires less duplication of code. I didn't find anything that related to this in the documentation or from other questions.
Reproducible data:
library(tidyverse)
set.seed(500)
dat <-
data_frame(treatment = sample(c("Group1", "Group2", "Group3"), 100, replace = TRUE),
recruitment_strategy = sample(c("Strategy 1", "Strategy 2", "Strategy 3", "Strategy 4", "Strategy 5"), 100, replace = TRUE),
Variable_A = rnorm(100),
Variable_B = rnorm(100),
Variable_C = rnorm(100))
Code to calculate means of several variables by group AND the means from the overall dataset:
count_by_group <- dat %>%
group_by(treatment) %>%
count(recruitment_strategy) %>%
mutate(`n (%)` = paste0(n, " (", round(n / sum(n)*100,0), "%)")) %>%
select(-n) %>%
spread(treatment, `n (%)`)
count_overall <- dat %>%
count(recruitment_strategy) %>%
mutate(`n (%)` = paste0(n, " (", round(n / sum(n)*100,0), "%)")) %>%
select(-n) %>%
rename(Overall_dataset = `n (%)`)
left_join(count_by_group, count_overall)
The desired output is achieved with the above code: a table of means for each group, next to overall means:
variable Group1 Group2 Group3 Overall_dataset
<chr> <dbl> <dbl> <dbl> <dbl>
1 Variable_A -0.154 0.0385 0.263 0.0351
2 Variable_B 0.212 -0.232 -0.124 -0.0671
3 Variable_C -0.195 0.194 0.0508 0.0376
A similar process on a categorical varaible to get counts and percentages for each group, and for the overall dataset:
count_by_group <- dat %>%
group_by(treatment) %>%
count(recruitment_strategy) %>%
mutate(`n (%)` = paste0(n, " (", round(n / sum(n)*100,0), "%)")) %>% # calculate percentage in the desired format for table
select(-n) %>%
spread(treatment, `n (%)`)
count_overall <- dat %>%
count(recruitment_strategy) %>%
mutate(`n (%)` = paste0(n, " (", round(n / sum(n)*100,0), "%)")) %>% # calculate percentage in the desired format for table
select(-n) %>%
rename(Overall_dataset = `n (%)`)
left_join(count_by_group, count_overall)
recruitment_strategy Group1 Group2 Group3 Overall_dataset
<chr> <chr> <chr> <chr> <chr>
1 Strategy 1 2 (6%) 13 (30%) 4 (16%) 19 (19%)
2 Strategy 2 8 (26%) 6 (14%) 6 (24%) 20 (20%)
3 Strategy 3 6 (19%) 12 (27%) 3 (12%) 21 (21%)
4 Strategy 4 9 (29%) 4 (9%) 5 (20%) 18 (18%)
5 Strategy 5 6 (19%) 9 (20%) 7 (28%) 22 (22%)
Is there a solution that can get a grouped summary AND an overall summary in a single step, rather than requiring the assignment of two separate objects that are then joined into a third object?
Upvotes: 2
Views: 727
Reputation: 15052
Here is how I would rewrite your code.
There is a trick with pipes to use the .
to put the LHS in multiple places on the RHS. That lets you do the join without needing to assign the intermediate objects. I also used a few more steps for a different balance of clarity vs not repeating myself, such as doing all the grouping inside count()
and using its name
argument, using mutate_at
to do all the formatting after the join, and using str_glue
and scales::percent
to make the string formatting a little more readable.
All of this is a matter of preference to some degree, but I think avoiding the intermediate assignments (and the burden of having to name said objects) is solved by the below approach.
library(tidyverse)
set.seed(500)
dat <- tibble(
treatment = sample(c("Group1", "Group2", "Group3"), 100, replace = TRUE),
recruitment_strategy = sample(c("Strategy 1", "Strategy 2", "Strategy 3", "Strategy 4", "Strategy 5"), 100, replace = TRUE),
Variable_A = rnorm(100),
Variable_B = rnorm(100),
Variable_C = rnorm(100)
)
dat %>%
inner_join(
x = count(., treatment, recruitment_strategy) %>% spread(treatment, n),
y = count(., recruitment_strategy, name = "Overall_dataset"),
by = "recruitment_strategy"
) %>%
mutate_at(
.vars = vars(-recruitment_strategy),
.funs = ~ str_glue("{.} ({scales::percent(. / sum(.), accuracy = 1)})")
)
#> # A tibble: 5 x 5
#> recruitment_strategy Group1 Group2 Group3 Overall_dataset
#> <chr> <glue> <glue> <glue> <glue>
#> 1 Strategy 1 2 (6%) 13 (30%) 4 (16%) 19 (19%)
#> 2 Strategy 2 8 (26%) 6 (14%) 6 (24%) 20 (20%)
#> 3 Strategy 3 6 (19%) 12 (27%) 3 (12%) 21 (21%)
#> 4 Strategy 4 9 (29%) 4 (9%) 5 (20%) 18 (18%)
#> 5 Strategy 5 6 (19%) 9 (20%) 7 (28%) 22 (22%)
Created on 2019-11-10 by the reprex package (v0.3.0)
Upvotes: 4