Reputation: 1396
How can I preserve one of the grouping names when using dply::summarise? Or, is there a better way to preserve one of the group names? I am probably going about this rather inefficiently.
I have a data.frame (df) as such:
dput(head(df, n = 20))
structure(list(file_src = c("CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx",
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx",
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx",
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx",
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx",
"CBG_EFD.xlsx"), AU = c("CBD", "CBD", "CBD", "CBD", "CBD", "CBD",
"CBD", "CBD", "CBD", "CBD", "CBD", "CBD", "CBD", "CBD", "CBD",
"CBD", "CBD", "CBD", "CBD", "CBD"), BU = c("OAO", "Constr", "Retail",
"OAO", "Constr", "Retail", "OAO", "Constr", "Retail", "OAO",
"Constr", "Retail", "OAO", "Constr", "Retail", "OAO", "Constr",
"Retail", "OAO", "Constr"), CC = c("AUDIT", "AUDIT", "AUDIT",
"AUDIT", "AUDIT", "AUDIT", "CORC", "CORC", "CORC", "CORC", "CORC",
"CORC", "CORC", "CORC", "CORC", "CORC", "CORC", "CORC", "CORC",
"CORC"), CA_LVL = c("AUDIT01", "AUDIT01", "AUDIT01", "AUDIT02",
"AUDIT02", "AUDIT02", "CORC01", "CORC01", "CORC01", "CORC02",
"CORC02", "CORC02", "CORC03", "CORC03", "CORC03", "CORC04", "CORC04",
"CORC04", "CORC05", "CORC05"), Score = c(1, 1, 2, 1, 3, 3, 1,
3, 2, 2, 4, 2, 2, 3, 1, 4, 2, 3, 3, 2)), .Names = c("file_src",
"AU", "BU", "CC", "CA_LVL", "Score"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Defintions Where AU is a set of five (5) 'groups' and BU is a set of fifty-five (55) units all of which belong to one of the five AUs. Parent-child relationship. Score is a raw number 0-4. Control_Category is a variable of which there are six (string values).
Currently, my code is broken out such that the script performs two levels of grouping and aggregating of scores to give a simple average. I group first at the AU level to get simple averages across all of the units of a given group for their respective Categories (CC). In the end, I have five data.frames (cbg.au.stat.wide, cbd.au.stat.wide, etc). These dfs represent the average score for a given category across all units of a given group.
# Group1 assessment unit scores
cbg.au.stat.wide <- df %>%
group_by(AU, CC) %>%
filter(AU == "CBG") %>%
summarise(avg = mean(Score, na.rm = TRUE)) %>%
dcast(AU ~ CC, value.var = "avg") %>%
print() # end chain
Which produces:
cbg.au.stat.wide
AU AUDIT CORC GOV PPS TMSC TRAIN
1 CBG 3 2.733333 2.2 2.666667 1.583333 2.666667
Later, all of the 'AU level' data frames are combined using dplyr::bind_rows
au.avg.scores <- bind_rows(
bsa.au.stat.wide,bsg.au.stat.wide,cbd.au.stat.wide,
cbg.au.stat.wide,wmg.au.stat.wide)
au.avg.scores
AU AUDIT CORC GOV PPS TMSC TRAIN
1 BSA Admin 2.833333 2.000000 2.733333 2.000000 1.750000 2.333333
2 BSG 2.833333 0.000000 2.733333 2.000000 1.750000 2.333333
3 CBD 1.833333 2.533333 2.466667 2.000000 2.500000 2.166667
4 CBG 3.000000 2.733333 2.200000 2.666667 1.583333 2.666667
5 WMG 2.625000 1.816667 2.533333 2.166667 1.895833 2.375000
Then I perform a similar grouping and summarizing activity. Only this time instead of being at the AU level (parent) I do it at the BU level for each Category (CC). So, then for a given AU I know have a table of BU average scores across their Control Categories.
# Group1 business units by Control Category
cbg.bu.stat.wide <- df %>%
group_by(BU, CC) %>%
filter(AU == "CBG") %>%
summarise(avg = mean(Score, na.rm = TRUE)) %>%
dcast(BU ~ CC, value.var = "avg") %>%
print() # end chain
Which produces:
BU AUDIT CORC GOV PPS TMSC TRAIN
1 Capital Markets 3 3.2 1.6 4 1.00 3
2 EFD 4 2.6 1.6 3 1.75 3
3 Global Trade Solutions 3 2.4 3.4 1 2.00 2
4 Investigations 1 NA NA NA NA NA
What I think you'll notice here at the 'BU' level is that the 'AU' level has been dropped. Eventually, I'd like to combine all of these BUs into a big table that shows the BU and the AU from whence it originated
So that it would end up looking like this:
> bu.avg.scores
AU BU AUDIT CORC GOV PPS TMSC TRAIN
CBG Adherence 3.0 1.4 3.2 1 1.50 3.0
CBG CTR 2.0 2.8 2.0 4 1.50 2.5
CBG HRCU 3.5 1.8 3.0 1 2.25 1.5
CBD Investigations 2.0 NA NA NA NA NA
BSG ACH 2.0 0.0 2.0 4 1.50 2.5
Upvotes: 2
Views: 1348
Reputation: 93761
Here's an updated answer, based on the comment thread. We summarise by AU
and BU
separately and store the results in a list. We then show how to combine the summaries into a single data frame and to output the summary as a table.
library(tidyverse)
# Summarize by AU and (separately) by BU and store each summary in a list
dfs = list(AU = df %>%
group_by(AU, CC) %>%
summarise(avg=mean(Score, na.rm=TRUE)),
BU = df %>%
group_by(BU, CC) %>%
summarise(avg=mean(Score, na.rm=TRUE)))
Each summary is now stored in a separate list element. This keeps the two different levels of summary separate, but stored in one object so it's easy to process further.
dfs
$AU AU CC avg 1 CBD AUDIT 1.833333 2 CBD CORC 2.428571 $BU BU CC avg 1 Constr AUDIT 2.0 2 Constr CORC 2.8 3 OAO AUDIT 1.0 4 OAO CORC 2.4 5 Retail AUDIT 2.5 6 Retail CORC 2.0
If you want a single data frame, you could do this:
# Combine into a single table and spread
df.table = bind_rows(dfs, .id="Unit Level") %>%
replace(., is.na(.), "") %>% # To avoid "NA" values when we "unite" below
unite(Unit, AU, BU, sep="") %>%
spread(CC, avg)
df.table
`Unit Level` Unit AUDIT CORC 1 AU CBD 1.833333 2.428571 2 BU Constr 2.000000 2.800000 3 BU OAO 1.000000 2.400000 4 BU Retail 2.500000 2.000000
If you're creating a report in rmarkdown
, you can turn this into an output table. Here's an example where we remove repeated row identifiers:
```{r}
knitr::kable(df.table %>%
mutate(`Unit Level` = replace(`Unit Level`, duplicated(`Unit Level`), "")))
```
This is what the table looks like when output in a PDF file:
Or, if you want to add a midrule line to separate the AU
and BU
averages, you could do this:
```{r, results="asis"}
library(xtable)
options(xtable.include.rownames=FALSE, xtable.comment=FALSE)
print(xtable(df.table %>%
mutate(`Unit Level` = replace(`Unit Level`, duplicated(`Unit Level`), ""))),
hline.after=c(-1,0,cumsum(table(df.table["Unit Level"]))))
```
In the code below we first calculate average at the AU
and BU
level. Then we calculate the average at the AU
level and use bind_rows
to combine the two levels of averages. Then we can spread
the resulting data frame to wide format.
library(tidyverse)
# Get averages at the AU-BU level
dfs = df %>%
group_by(AU, BU, CC) %>%
summarise(avg = mean(Score, na.rm = TRUE))
dfs
AU BU CC n avg 1 CBD Constr AUDIT 2 2.0 2 CBD Constr CORC 5 2.8 3 CBD OAO AUDIT 2 1.0 4 CBD OAO CORC 5 2.4 5 CBD Retail AUDIT 2 2.5 6 CBD Retail CORC 4 2.0
# Combine with averages at the AU level
dfs = bind_rows(dfs,
df %>%
group_by(AU, CC) %>%
summarise(avg = mean(Score, na.rm = TRUE)) %>%
mutate(BU = paste("All", AU,"BU")))
dfs
AU BU CC avg 1 CBD Constr AUDIT 2.000000 2 CBD Constr CORC 2.800000 3 CBD OAO AUDIT 1.000000 4 CBD OAO CORC 2.400000 5 CBD Retail AUDIT 2.500000 6 CBD Retail CORC 2.000000 7 CBD All CBD BU AUDIT 1.833333 8 CBD All CBD BU CORC 2.428571
# Spread (does same thing as dcast, but using tidyr spread function)
dfs %>% spread(CC, avg)
AU BU AUDIT CORC 1 CBD All CBD BU 1.833333 2.428571 2 CBD Constr 2.000000 2.800000 3 CBD OAO 1.000000 2.400000 4 CBD Retail 2.500000 2.000000
This can be combined into a single chain:
dfs = df %>%
group_by(AU, BU, CC) %>%
summarise(avg = mean(Score, na.rm = TRUE)) %>%
bind_rows(
df %>%
group_by(AU, CC) %>%
summarise(avg = mean(Score, na.rm = TRUE)) %>%
mutate(BU = paste("All", AU,"BU"))
) %>%
spread(CC, avg)
Upvotes: 1