Reputation: 103
I am using the dplyr package to generate some tables and I'm making use of the adorn_totals("row")
function.
This works fine when I want to sum values within the groups, however in some cases I want an overall mean instead of a sum. Is there an adorn_means function?
Sample code:
Regions2 <- Data %>%
filter(!is.na(REGION))%>%
group_by(REGION) %>%
summarise(Numberofpeople=length(Names))%>%
adorn_totals("row")
here my "total" row is simply the sum of all people within the regions. This gives me
REGION NumberofPeople
East Midlands 578,943
East of England 682,917
London 1,247,540
North East 245,830
North West 742,886
South East 963,040
South West 623,684
West Midlands 653,335
Yorkshire 553,853
TOTAL 6,292,028
My next piece of code generates an average salary for each region, but I want to add an overall average for the total
Regions3 <- Data %>%
filter(!is.na(REGION))%>%
filter(!is.na(AVGSalary))%>%
group_by(REGION) %>%
summarise(AverageSalary=mean(AVGSalary))
if I use adnorn_totals("row")
as before I simply get the sum of the averages, not the overall average for the dataset.
How do I get the overall average?
UPADATE with some noddy data:
Data
people region salary
person1 London 1000
person2 South West 1050
person3 South East 900
person4 London 800
person5 Scotland 1020
person6 South West 750
person7 East 600
person8 London 1200
person9 South West 1150
The group averages are therefore:
London 1000
South West 983.33
South East 900
Scotland 1020
East 600
I want to add the overall total to the bottom
Total 941.11
Upvotes: 4
Views: 2131
Reputation: 269441
1) Because the overall average is the weighted average of the averages (not the plain average of the averages), i.e. it is 941 and not 901, we maintain an n
column so that in the end we can correctly compute the overall average. Although the data shown does not have any NAs we use drop_na
in order to also use it with such data. This will remove any row containing an NA.
library(dplyr)
library(tidyr)
Region %>%
drop_na %>%
group_by(region) %>%
summarize(avg = mean(salary), n = n()) %>%
ungroup %>%
bind_rows(summarize(., region = "Overall Avg",
avg = sum(avg * n) / sum(n),
n = sum(n))) %>%
select(-n)
giving:
# A tibble: 6 x 2
region avg
<chr> <dbl>
1 East 600
2 London 1000
3 Scotland 1020
4 South East 900
5 South West 983.
6 Overall Avg 941.
2) Another approach would be to construct the Overall Avg line by going back to the original data:
Region %>%
drop_na %>%
group_by(region) %>%
summarize(avg = mean(salary)) %>%
ungroup %>%
bind_rows(summarize(Region %>% drop_na, region = "Overall Avg", avg = mean(salary)))
giving:
# A tibble: 6 x 2
region avg
<chr> <dbl>
1 East 600
2 London 1000
3 Scotland 1020
4 South East 900
5 South West 983.
6 Overall Avg 941.
2a) If you object to referring to Region
twice then try this.
Region_ <- Region %>%
drop_na
Region_ %>%
group_by(region) %>%
summarize(avg = mean(salary)) %>%
ungroup %>%
bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))
2b) or as a single pipeline where now Region_
is local to the pipeline and will automatically be removed after the pipeline completes:
Region %>%
drop_na %>%
{ Region_ <- .
Region_ %>%
group_by(region) %>%
summarize(avg = mean(salary)) %>%
ungroup %>%
bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))
}
We used this as the input:
Lines <- "people region salary
person1 London 1000
person2 South West 1050
person3 South East 900
person4 London 800
person5 Scotland 1020
person6 South West 750
person7 East 600
person8 London 1200
person9 South West 1150"
library(gsubfn)
Region <- read.pattern(text = Lines, pattern = "^(\\S+) +(.*) (\\d+)$",
as.is = TRUE, skip = 1, strip.white = TRUE,
col.names = read.table(text = Lines, nrow = 1, as.is = TRUE))
Upvotes: 6
Reputation: 886938
One option is to add a row with bind_rows
library(dplyr)
Data %>%
group_by(region) %>%
summarise(Avgsalary = mean(salary)) %>%
bind_rows(data_frame(region = 'Total',
Avgsalary = mean(.$Avgsalary, na.rm = TRUE)))
Or another option is add_row
from tibble
Data %>%
group_by(region) %>%
summarise(Avgsalary = mean(salary)) %>%
add_row(region = 'Total', Avgsalary = mean(.$Avgsalary))
If this is based on the overall mean before taking the mean
, then we need to calculate it before
Data %>%
mutate(Total = mean(salary)) %>%
group_by(region) %>%
summarise(Avgsummary = mean(salary), Total = first(Total)) %>%
add_row(region = 'Total', Avgsummary = .$Total[1]) %>%
select(-Total)
Upvotes: 3