How to maintain original row names when binding rows from different data frames?

Question

How can I preserve one of the grouping names when using dply::summarise? Or, is there a better way to preserve one of the group names? I am probably going about this rather inefficiently.

I have a data.frame (df) as such:

dput(head(df, n = 20))
structure(list(file_src = c("CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", 
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", 
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", 
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", 
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", 
"CBG_EFD.xlsx"), AU = c("CBD", "CBD", "CBD", "CBD", "CBD", "CBD", 
"CBD", "CBD", "CBD", "CBD", "CBD", "CBD", "CBD", "CBD", "CBD", 
"CBD", "CBD", "CBD", "CBD", "CBD"), BU = c("OAO", "Constr", "Retail", 
"OAO", "Constr", "Retail", "OAO", "Constr", "Retail", "OAO", 
"Constr", "Retail", "OAO", "Constr", "Retail", "OAO", "Constr", 
"Retail", "OAO", "Constr"), CC = c("AUDIT", "AUDIT", "AUDIT", 
"AUDIT", "AUDIT", "AUDIT", "CORC", "CORC", "CORC", "CORC", "CORC", 
"CORC", "CORC", "CORC", "CORC", "CORC", "CORC", "CORC", "CORC", 
"CORC"), CA_LVL = c("AUDIT01", "AUDIT01", "AUDIT01", "AUDIT02", 
"AUDIT02", "AUDIT02", "CORC01", "CORC01", "CORC01", "CORC02", 
"CORC02", "CORC02", "CORC03", "CORC03", "CORC03", "CORC04", "CORC04", 
"CORC04", "CORC05", "CORC05"), Score = c(1, 1, 2, 1, 3, 3, 1, 
3, 2, 2, 4, 2, 2, 3, 1, 4, 2, 3, 3, 2)), .Names = c("file_src", 
"AU", "BU", "CC", "CA_LVL", "Score"), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

Defintions Where AU is a set of five (5) 'groups' and BU is a set of fifty-five (55) units all of which belong to one of the five AUs. Parent-child relationship. Score is a raw number 0-4. Control_Category is a variable of which there are six (string values).

Currently, my code is broken out such that the script performs two levels of grouping and aggregating of scores to give a simple average. I group first at the AU level to get simple averages across all of the units of a given group for their respective Categories (CC). In the end, I have five data.frames (cbg.au.stat.wide, cbd.au.stat.wide, etc). These dfs represent the average score for a given category across all units of a given group.

# Group1 assessment unit scores
cbg.au.stat.wide <- df %>%
  group_by(AU, CC) %>%
  filter(AU == "CBG") %>%
  summarise(avg = mean(Score, na.rm = TRUE)) %>%
  dcast(AU ~ CC, value.var = "avg") %>%
  print()  # end chain

Which produces:

cbg.au.stat.wide
   AU AUDIT     CORC GOV      PPS     TMSC    TRAIN
1 CBG     3 2.733333 2.2 2.666667 1.583333 2.666667

Later, all of the 'AU level' data frames are combined using dplyr::bind_rows

au.avg.scores <- bind_rows(
  bsa.au.stat.wide,bsg.au.stat.wide,cbd.au.stat.wide,
  cbg.au.stat.wide,wmg.au.stat.wide)

au.avg.scores
         AU    AUDIT     CORC      GOV      PPS     TMSC    TRAIN
1 BSA Admin 2.833333 2.000000 2.733333 2.000000 1.750000 2.333333
2       BSG 2.833333 0.000000 2.733333 2.000000 1.750000 2.333333
3       CBD 1.833333 2.533333 2.466667 2.000000 2.500000 2.166667
4       CBG 3.000000 2.733333 2.200000 2.666667 1.583333 2.666667
5       WMG 2.625000 1.816667 2.533333 2.166667 1.895833 2.375000

Then I perform a similar grouping and summarizing activity. Only this time instead of being at the AU level (parent) I do it at the BU level for each Category (CC). So, then for a given AU I know have a table of BU average scores across their Control Categories.

# Group1 business units by Control Category
cbg.bu.stat.wide <- df %>%
  group_by(BU, CC) %>%
  filter(AU == "CBG") %>%
  summarise(avg = mean(Score, na.rm = TRUE)) %>%
  dcast(BU ~ CC, value.var = "avg") %>%
  print() # end chain

Which produces:

                      BU AUDIT CORC GOV PPS TMSC TRAIN
1        Capital Markets     3  3.2 1.6   4 1.00     3
2                    EFD     4  2.6 1.6   3 1.75     3
3 Global Trade Solutions     3  2.4 3.4   1 2.00     2
4         Investigations     1   NA  NA  NA   NA    NA

What I think you'll notice here at the 'BU' level is that the 'AU' level has been dropped. Eventually, I'd like to combine all of these BUs into a big table that shows the BU and the AU from whence it originated

So that it would end up looking like this:

> bu.avg.scores
AU BU AUDIT CORC GOV PPS TMSC TRAIN
CBG Adherence   3.0  1.4 3.2   1 1.50   3.0
CBG CTR   2.0  2.8 2.0   4 1.50   2.5
CBG HRCU   3.5  1.8 3.0   1 2.25   1.5
CBD Investigations   2.0   NA  NA  NA   NA    NA
BSG ACH   2.0  0.0 2.0   4 1.50   2.5

eipi10 · Accepted Answer

Updated Answer

Here's an updated answer, based on the comment thread. We summarise by AU and BU separately and store the results in a list. We then show how to combine the summaries into a single data frame and to output the summary as a table.

library(tidyverse)

# Summarize by AU and (separately) by BU and store each summary in a list
dfs = list(AU = df %>% 
             group_by(AU, CC) %>% 
             summarise(avg=mean(Score, na.rm=TRUE)),
           BU = df %>% 
             group_by(BU, CC) %>% 
             summarise(avg=mean(Score, na.rm=TRUE)))

Each summary is now stored in a separate list element. This keeps the two different levels of summary separate, but stored in one object so it's easy to process further.

dfs

$AU
     AU    CC      avg
1   CBD AUDIT 1.833333
2   CBD  CORC 2.428571

$BU
      BU    CC   avg
1 Constr AUDIT   2.0
2 Constr  CORC   2.8
3    OAO AUDIT   1.0
4    OAO  CORC   2.4
5 Retail AUDIT   2.5
6 Retail  CORC   2.0

If you want a single data frame, you could do this:

# Combine into a single table and spread
df.table = bind_rows(dfs, .id="Unit Level") %>% 
  replace(., is.na(.), "") %>%  # To avoid "NA" values when we "unite" below
  unite(Unit, AU, BU, sep="") %>% 
  spread(CC, avg)

df.table

  `Unit Level`   Unit    AUDIT     CORC
1           AU    CBD 1.833333 2.428571
2           BU Constr 2.000000 2.800000
3           BU    OAO 1.000000 2.400000
4           BU Retail 2.500000 2.000000

If you're creating a report in rmarkdown, you can turn this into an output table. Here's an example where we remove repeated row identifiers:

```{r}
knitr::kable(df.table %>% 
               mutate(`Unit Level` = replace(`Unit Level`, duplicated(`Unit Level`), "")))
```

This is what the table looks like when output in a PDF file:

Or, if you want to add a midrule line to separate the AU and BU averages, you could do this:

```{r, results="asis"}
library(xtable)
options(xtable.include.rownames=FALSE, xtable.comment=FALSE)

print(xtable(df.table %>% 
               mutate(`Unit Level` = replace(`Unit Level`, duplicated(`Unit Level`), ""))),
             hline.after=c(-1,0,cumsum(table(df.table["Unit Level"]))))
```

Original Answer

In the code below we first calculate average at the AU and BU level. Then we calculate the average at the AU level and use bind_rows to combine the two levels of averages. Then we can spread the resulting data frame to wide format.

library(tidyverse)

# Get averages at the AU-BU level
dfs = df %>%
  group_by(AU, BU, CC) %>%
  summarise(avg = mean(Score, na.rm = TRUE)) 

dfs

     AU     BU    CC     n   avg
1   CBD Constr AUDIT     2   2.0
2   CBD Constr  CORC     5   2.8
3   CBD    OAO AUDIT     2   1.0
4   CBD    OAO  CORC     5   2.4
5   CBD Retail AUDIT     2   2.5
6   CBD Retail  CORC     4   2.0

# Combine with averages at the AU level
dfs = bind_rows(dfs, 
                df %>%
                  group_by(AU, CC) %>%
                  summarise(avg = mean(Score, na.rm = TRUE)) %>% 
                  mutate(BU = paste("All", AU,"BU")))

dfs

     AU         BU    CC      avg
1   CBD     Constr AUDIT 2.000000
2   CBD     Constr  CORC 2.800000
3   CBD        OAO AUDIT 1.000000
4   CBD        OAO  CORC 2.400000
5   CBD     Retail AUDIT 2.500000
6   CBD     Retail  CORC 2.000000
7   CBD All CBD BU AUDIT 1.833333
8   CBD All CBD BU  CORC 2.428571

# Spread (does same thing as dcast, but using tidyr spread function)
dfs %>% spread(CC, avg)

     AU         BU    AUDIT     CORC
1   CBD All CBD BU 1.833333 2.428571
2   CBD     Constr 2.000000 2.800000
3   CBD        OAO 1.000000 2.400000
4   CBD     Retail 2.500000 2.000000

This can be combined into a single chain:

dfs = df %>%
  group_by(AU, BU, CC) %>%
  summarise(avg = mean(Score, na.rm = TRUE)) %>% 
  bind_rows(
    df %>%
      group_by(AU, CC) %>%
      summarise(avg = mean(Score, na.rm = TRUE)) %>% 
      mutate(BU = paste("All", AU,"BU"))
  ) %>% 
  spread(CC, avg)

How to maintain original row names when binding rows from different data frames?

Answers (1)

Updated Answer

Original Answer

Related Questions