debrah
debrah

Reputation: 65

Dplyr: summarise simultaneously for groups and entire data

I am computing summary statistcs for many variables in a large data frame (it has 130 variables). I wish to compute summary statistics per decade and for all my data. I would like to avoid writing summarise twice: once for the grouping and once for the full data. I also use summarise_at, so my question also applies for summarise_at.

Here is a minimal example, where summarise_at() seems a bit over the top, but not so for my real data:

my.data <- data.frame(Date = as.Date(c('1981-04-09', '1983-02-01', '1992-10-19', '1996-11-22', '1987-05-15')),
  decade = c('Eighties', 'Eighties', 'Nineties', 'Nineties', 'Eighties'),
  price = c(10, 11, 17, 34, 12),
  d.sector.Oil = c(0,0,1,1,1),
  d.sector.Mines = c(1,1,0,0,0)) # An example dataframe
# Calculate summary statistics for each decade and each variable:
sumst.decades<- my.data %>% group_by(decade) %>% summarise(mean.price = mean(price))
sumd.decades<- my.data %>% group_by(decade) %>% summarise_at(vars(starts_with('d.sector.')), sum)

I need the same summaries for my full data. For which I would repeat above command lines, but without the "group_by":

sumst<- summarise(my.data, mean.price = mean(price))
sumd<- summarise_at(my.data, vars(starts_with('d.sector.')), sum)

I wish to not have to write the last two lines but somehow have dplyr give me a global summary in addition to the groupwise summaries. For example, if there were a function group_by* that creates groups as required and also creates a global group. Does something like this exist? I hope my question is clear. Thanks in advance.

Upvotes: 4

Views: 4196

Answers (2)

hello_friend
hello_friend

Reputation: 5798

Tidyverse solution:

library(tidyverse)

my.data %>% 
  mutate(decade = as.character(decade)) %>%
  bind_rows(., my.data %>% mutate(decade = as.character("All"))) %>%
  group_by(decade) %>% 
  summarise_if(is.numeric, c("sum", "mean")) %>% 
  ungroup()

Upvotes: 7

Simon.S.A.
Simon.S.A.

Reputation: 6941

I am not aware of an existing dplyr function that does this. But I have written a work around function you can use for this purpose:

overall_group = function(data, col_name){

  d1 = data %>%
    mutate(summary_level = "grouped")

  d2 = data %>%
    mutate(summary_level = "ungrouped") %>%
    mutate(!!sym(col_name) := NA)

  d12 = rbind(d1, d2) %>%
    group_by(summary_level, !!sym(col_name))

  return(d12)
}

Demonstration:

> mtcars %>% summarise(avg_mpg = mean(mpg))
   avg_mpg
1 20.09062
> mtcars %>% group_by(gear) %>% summarise(avg_mpg = mean(mpg))
# A tibble: 3 x 2
   gear avg_mpg
  <dbl>   <dbl>
1     3    16.1
2     4    24.5
3     5    21.4
> mtcars %>% overall_group("gear") %>% summarise(avg_mpg = mean(mpg))
# A tibble: 4 x 3
# Groups:   summary_level [2]
  summary_level  gear avg_mpg
  <chr>         <dbl>   <dbl>
1 grouped           3    16.1
2 grouped           4    24.5
3 grouped           5    21.4
4 ungrouped        NA    20.1

This approach is a hack on the group_by. It would be better to implement the overall_* functionality on the summarise. However, I don't know enough about the inner workings of summarise to design such a function.

Alternative, having a row that gives a total or overall average is more common when presenting tables for reports. So you might be better off searching for this kind of summary function in a package that focuses on presenting tables, instead of using dplyr.

Upvotes: 5

Related Questions