Reputation: 65
I am computing summary statistcs for many variables in a large data frame (it has 130 variables). I wish to compute summary statistics per decade and for all my data. I would like to avoid writing summarise twice: once for the grouping and once for the full data. I also use summarise_at, so my question also applies for summarise_at.
Here is a minimal example, where summarise_at() seems a bit over the top, but not so for my real data:
my.data <- data.frame(Date = as.Date(c('1981-04-09', '1983-02-01', '1992-10-19', '1996-11-22', '1987-05-15')),
decade = c('Eighties', 'Eighties', 'Nineties', 'Nineties', 'Eighties'),
price = c(10, 11, 17, 34, 12),
d.sector.Oil = c(0,0,1,1,1),
d.sector.Mines = c(1,1,0,0,0)) # An example dataframe
# Calculate summary statistics for each decade and each variable:
sumst.decades<- my.data %>% group_by(decade) %>% summarise(mean.price = mean(price))
sumd.decades<- my.data %>% group_by(decade) %>% summarise_at(vars(starts_with('d.sector.')), sum)
I need the same summaries for my full data. For which I would repeat above command lines, but without the "group_by":
sumst<- summarise(my.data, mean.price = mean(price))
sumd<- summarise_at(my.data, vars(starts_with('d.sector.')), sum)
I wish to not have to write the last two lines but somehow have dplyr give me a global summary in addition to the groupwise summaries. For example, if there were a function group_by* that creates groups as required and also creates a global group. Does something like this exist? I hope my question is clear. Thanks in advance.
Upvotes: 4
Views: 4196
Reputation: 5798
Tidyverse solution:
library(tidyverse)
my.data %>%
mutate(decade = as.character(decade)) %>%
bind_rows(., my.data %>% mutate(decade = as.character("All"))) %>%
group_by(decade) %>%
summarise_if(is.numeric, c("sum", "mean")) %>%
ungroup()
Upvotes: 7
Reputation: 6941
I am not aware of an existing dplyr function that does this. But I have written a work around function you can use for this purpose:
overall_group = function(data, col_name){
d1 = data %>%
mutate(summary_level = "grouped")
d2 = data %>%
mutate(summary_level = "ungrouped") %>%
mutate(!!sym(col_name) := NA)
d12 = rbind(d1, d2) %>%
group_by(summary_level, !!sym(col_name))
return(d12)
}
Demonstration:
> mtcars %>% summarise(avg_mpg = mean(mpg))
avg_mpg
1 20.09062
> mtcars %>% group_by(gear) %>% summarise(avg_mpg = mean(mpg))
# A tibble: 3 x 2
gear avg_mpg
<dbl> <dbl>
1 3 16.1
2 4 24.5
3 5 21.4
> mtcars %>% overall_group("gear") %>% summarise(avg_mpg = mean(mpg))
# A tibble: 4 x 3
# Groups: summary_level [2]
summary_level gear avg_mpg
<chr> <dbl> <dbl>
1 grouped 3 16.1
2 grouped 4 24.5
3 grouped 5 21.4
4 ungrouped NA 20.1
This approach is a hack on the group_by. It would be better to implement the overall_*
functionality on the summarise. However, I don't know enough about the inner workings of summarise to design such a function.
Alternative, having a row that gives a total or overall average is more common when presenting tables for reports. So you might be better off searching for this kind of summary function in a package that focuses on presenting tables, instead of using dplyr.
Upvotes: 5