Nottles82
Nottles82

Reputation: 103

Adding overall mean when using group_by

I am using the dplyr package to generate some tables and I'm making use of the adorn_totals("row") function.

This works fine when I want to sum values within the groups, however in some cases I want an overall mean instead of a sum. Is there an adorn_means function?

Sample code:

Regions2 <- Data %>%
  filter(!is.na(REGION))%>%
  group_by(REGION) %>%
  summarise(Numberofpeople=length(Names))%>%
  adorn_totals("row")

here my "total" row is simply the sum of all people within the regions. This gives me

REGION          NumberofPeople
East Midlands       578,943
East of England     682,917
London            1,247,540
North East          245,830
North West          742,886
South East          963,040
South West          623,684
West Midlands       653,335
Yorkshire           553,853
TOTAL             6,292,028

My next piece of code generates an average salary for each region, but I want to add an overall average for the total

Regions3 <- Data %>%
  filter(!is.na(REGION))%>%
  filter(!is.na(AVGSalary))%>%
  group_by(REGION) %>%
  summarise(AverageSalary=mean(AVGSalary))

if I use adnorn_totals("row") as before I simply get the sum of the averages, not the overall average for the dataset.

How do I get the overall average?

UPADATE with some noddy data:

Data

people  region      salary
person1 London      1000
person2 South West  1050
person3 South East  900
person4 London      800
person5 Scotland    1020
person6 South West  750
person7 East        600
person8 London      1200
person9 South West  1150

The group averages are therefore:

London      1000
South West  983.33
South East  900
Scotland    1020
East        600

I want to add the overall total to the bottom

Total    941.11

Upvotes: 4

Views: 2131

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269441

1) Because the overall average is the weighted average of the averages (not the plain average of the averages), i.e. it is 941 and not 901, we maintain an n column so that in the end we can correctly compute the overall average. Although the data shown does not have any NAs we use drop_na in order to also use it with such data. This will remove any row containing an NA.

library(dplyr)
library(tidyr)

Region %>%
  drop_na %>%
  group_by(region) %>%
  summarize(avg = mean(salary), n = n()) %>%
  ungroup %>%
  bind_rows(summarize(., region = "Overall Avg", 
                         avg = sum(avg * n) / sum(n), 
                         n = sum(n))) %>%
  select(-n)

giving:

# A tibble: 6 x 2
  region        avg
  <chr>       <dbl>
1 East         600 
2 London      1000 
3 Scotland    1020 
4 South East   900 
5 South West   983.
6 Overall Avg  941.

2) Another approach would be to construct the Overall Avg line by going back to the original data:

Region %>%
  drop_na %>%
  group_by(region) %>%
  summarize(avg = mean(salary)) %>%
  ungroup %>%
  bind_rows(summarize(Region %>% drop_na, region = "Overall Avg", avg = mean(salary)))

giving:

# A tibble: 6 x 2
  region        avg
  <chr>       <dbl>
1 East         600 
2 London      1000 
3 Scotland    1020 
4 South East   900 
5 South West   983.
6 Overall Avg  941.

2a) If you object to referring to Region twice then try this.

Region_ <- Region %>% 
  drop_na

Region_ %>%
  group_by(region) %>%
  summarize(avg = mean(salary)) %>%
  ungroup %>%
  bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))

2b) or as a single pipeline where now Region_ is local to the pipeline and will automatically be removed after the pipeline completes:

Region %>%
  drop_na %>%
  { Region_ <- .
    Region_ %>%
      group_by(region) %>%
      summarize(avg = mean(salary)) %>%
      ungroup %>%
      bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))
  }

Note

We used this as the input:

Lines <- "people  region      salary
person1 London      1000
person2 South West  1050
person3 South East  900
person4 London      800
person5 Scotland    1020
person6 South West  750
person7 East        600
person8 London      1200
person9 South West  1150"

library(gsubfn)
Region <- read.pattern(text = Lines, pattern = "^(\\S+) +(.*) (\\d+)$", 
  as.is = TRUE, skip = 1, strip.white = TRUE,
  col.names = read.table(text = Lines, nrow = 1, as.is = TRUE))

Upvotes: 6

akrun
akrun

Reputation: 886938

One option is to add a row with bind_rows

library(dplyr)
Data %>% 
   group_by(region) %>% 
   summarise(Avgsalary = mean(salary)) %>%
   bind_rows(data_frame(region = 'Total',
                        Avgsalary = mean(.$Avgsalary, na.rm = TRUE)))

Or another option is add_row from tibble

Data %>% 
   group_by(region) %>% 
   summarise(Avgsalary = mean(salary)) %>% 
   add_row(region = 'Total', Avgsalary = mean(.$Avgsalary))

If this is based on the overall mean before taking the mean, then we need to calculate it before

Data %>%  
  mutate(Total = mean(salary)) %>% 
  group_by(region) %>%
  summarise(Avgsummary = mean(salary), Total = first(Total)) %>% 
  add_row(region = 'Total', Avgsummary = .$Total[1]) %>% 
  select(-Total)

Upvotes: 3

Related Questions