Zach
Zach

Reputation: 1396

How to maintain original row names when binding rows from different data frames?

How can I preserve one of the grouping names when using dply::summarise? Or, is there a better way to preserve one of the group names? I am probably going about this rather inefficiently.

I have a data.frame (df) as such:

dput(head(df, n = 20))
structure(list(file_src = c("CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", 
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", 
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", 
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", 
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", 
"CBG_EFD.xlsx"), AU = c("CBD", "CBD", "CBD", "CBD", "CBD", "CBD", 
"CBD", "CBD", "CBD", "CBD", "CBD", "CBD", "CBD", "CBD", "CBD", 
"CBD", "CBD", "CBD", "CBD", "CBD"), BU = c("OAO", "Constr", "Retail", 
"OAO", "Constr", "Retail", "OAO", "Constr", "Retail", "OAO", 
"Constr", "Retail", "OAO", "Constr", "Retail", "OAO", "Constr", 
"Retail", "OAO", "Constr"), CC = c("AUDIT", "AUDIT", "AUDIT", 
"AUDIT", "AUDIT", "AUDIT", "CORC", "CORC", "CORC", "CORC", "CORC", 
"CORC", "CORC", "CORC", "CORC", "CORC", "CORC", "CORC", "CORC", 
"CORC"), CA_LVL = c("AUDIT01", "AUDIT01", "AUDIT01", "AUDIT02", 
"AUDIT02", "AUDIT02", "CORC01", "CORC01", "CORC01", "CORC02", 
"CORC02", "CORC02", "CORC03", "CORC03", "CORC03", "CORC04", "CORC04", 
"CORC04", "CORC05", "CORC05"), Score = c(1, 1, 2, 1, 3, 3, 1, 
3, 2, 2, 4, 2, 2, 3, 1, 4, 2, 3, 3, 2)), .Names = c("file_src", 
"AU", "BU", "CC", "CA_LVL", "Score"), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

Defintions Where AU is a set of five (5) 'groups' and BU is a set of fifty-five (55) units all of which belong to one of the five AUs. Parent-child relationship. Score is a raw number 0-4. Control_Category is a variable of which there are six (string values).

Currently, my code is broken out such that the script performs two levels of grouping and aggregating of scores to give a simple average. I group first at the AU level to get simple averages across all of the units of a given group for their respective Categories (CC). In the end, I have five data.frames (cbg.au.stat.wide, cbd.au.stat.wide, etc). These dfs represent the average score for a given category across all units of a given group.

# Group1 assessment unit scores
cbg.au.stat.wide <- df %>%
  group_by(AU, CC) %>%
  filter(AU == "CBG") %>%
  summarise(avg = mean(Score, na.rm = TRUE)) %>%
  dcast(AU ~ CC, value.var = "avg") %>%
  print()  # end chain

Which produces:

cbg.au.stat.wide
   AU AUDIT     CORC GOV      PPS     TMSC    TRAIN
1 CBG     3 2.733333 2.2 2.666667 1.583333 2.666667

Later, all of the 'AU level' data frames are combined using dplyr::bind_rows

au.avg.scores <- bind_rows(
  bsa.au.stat.wide,bsg.au.stat.wide,cbd.au.stat.wide,
  cbg.au.stat.wide,wmg.au.stat.wide)

au.avg.scores
         AU    AUDIT     CORC      GOV      PPS     TMSC    TRAIN
1 BSA Admin 2.833333 2.000000 2.733333 2.000000 1.750000 2.333333
2       BSG 2.833333 0.000000 2.733333 2.000000 1.750000 2.333333
3       CBD 1.833333 2.533333 2.466667 2.000000 2.500000 2.166667
4       CBG 3.000000 2.733333 2.200000 2.666667 1.583333 2.666667
5       WMG 2.625000 1.816667 2.533333 2.166667 1.895833 2.375000

Then I perform a similar grouping and summarizing activity. Only this time instead of being at the AU level (parent) I do it at the BU level for each Category (CC). So, then for a given AU I know have a table of BU average scores across their Control Categories.

# Group1 business units by Control Category
cbg.bu.stat.wide <- df %>%
  group_by(BU, CC) %>%
  filter(AU == "CBG") %>%
  summarise(avg = mean(Score, na.rm = TRUE)) %>%
  dcast(BU ~ CC, value.var = "avg") %>%
  print() # end chain

Which produces:

                      BU AUDIT CORC GOV PPS TMSC TRAIN
1        Capital Markets     3  3.2 1.6   4 1.00     3
2                    EFD     4  2.6 1.6   3 1.75     3
3 Global Trade Solutions     3  2.4 3.4   1 2.00     2
4         Investigations     1   NA  NA  NA   NA    NA

What I think you'll notice here at the 'BU' level is that the 'AU' level has been dropped. Eventually, I'd like to combine all of these BUs into a big table that shows the BU and the AU from whence it originated

So that it would end up looking like this:

> bu.avg.scores
AU BU AUDIT CORC GOV PPS TMSC TRAIN
CBG Adherence   3.0  1.4 3.2   1 1.50   3.0
CBG CTR   2.0  2.8 2.0   4 1.50   2.5
CBG HRCU   3.5  1.8 3.0   1 2.25   1.5
CBD Investigations   2.0   NA  NA  NA   NA    NA
BSG ACH   2.0  0.0 2.0   4 1.50   2.5

Upvotes: 2

Views: 1348

Answers (1)

eipi10
eipi10

Reputation: 93761

Updated Answer

Here's an updated answer, based on the comment thread. We summarise by AU and BU separately and store the results in a list. We then show how to combine the summaries into a single data frame and to output the summary as a table.

library(tidyverse)

# Summarize by AU and (separately) by BU and store each summary in a list
dfs = list(AU = df %>% 
             group_by(AU, CC) %>% 
             summarise(avg=mean(Score, na.rm=TRUE)),
           BU = df %>% 
             group_by(BU, CC) %>% 
             summarise(avg=mean(Score, na.rm=TRUE)))

Each summary is now stored in a separate list element. This keeps the two different levels of summary separate, but stored in one object so it's easy to process further.

dfs
$AU
     AU    CC      avg
1   CBD AUDIT 1.833333
2   CBD  CORC 2.428571

$BU
      BU    CC   avg
1 Constr AUDIT   2.0
2 Constr  CORC   2.8
3    OAO AUDIT   1.0
4    OAO  CORC   2.4
5 Retail AUDIT   2.5
6 Retail  CORC   2.0

If you want a single data frame, you could do this:

# Combine into a single table and spread
df.table = bind_rows(dfs, .id="Unit Level") %>% 
  replace(., is.na(.), "") %>%  # To avoid "NA" values when we "unite" below
  unite(Unit, AU, BU, sep="") %>% 
  spread(CC, avg)

df.table
  `Unit Level`   Unit    AUDIT     CORC
1           AU    CBD 1.833333 2.428571
2           BU Constr 2.000000 2.800000
3           BU    OAO 1.000000 2.400000
4           BU Retail 2.500000 2.000000

If you're creating a report in rmarkdown, you can turn this into an output table. Here's an example where we remove repeated row identifiers:

```{r}
knitr::kable(df.table %>% 
               mutate(`Unit Level` = replace(`Unit Level`, duplicated(`Unit Level`), "")))
```

This is what the table looks like when output in a PDF file:

enter image description here

Or, if you want to add a midrule line to separate the AU and BU averages, you could do this:

```{r, results="asis"}
library(xtable)
options(xtable.include.rownames=FALSE, xtable.comment=FALSE)

print(xtable(df.table %>% 
               mutate(`Unit Level` = replace(`Unit Level`, duplicated(`Unit Level`), ""))),
             hline.after=c(-1,0,cumsum(table(df.table["Unit Level"]))))
```

enter image description here

Original Answer

In the code below we first calculate average at the AU and BU level. Then we calculate the average at the AU level and use bind_rows to combine the two levels of averages. Then we can spread the resulting data frame to wide format.

library(tidyverse)

# Get averages at the AU-BU level
dfs = df %>%
  group_by(AU, BU, CC) %>%
  summarise(avg = mean(Score, na.rm = TRUE)) 

dfs
     AU     BU    CC     n   avg
1   CBD Constr AUDIT     2   2.0
2   CBD Constr  CORC     5   2.8
3   CBD    OAO AUDIT     2   1.0
4   CBD    OAO  CORC     5   2.4
5   CBD Retail AUDIT     2   2.5
6   CBD Retail  CORC     4   2.0
# Combine with averages at the AU level
dfs = bind_rows(dfs, 
                df %>%
                  group_by(AU, CC) %>%
                  summarise(avg = mean(Score, na.rm = TRUE)) %>% 
                  mutate(BU = paste("All", AU,"BU")))

dfs   
     AU         BU    CC      avg
1   CBD     Constr AUDIT 2.000000
2   CBD     Constr  CORC 2.800000
3   CBD        OAO AUDIT 1.000000
4   CBD        OAO  CORC 2.400000
5   CBD     Retail AUDIT 2.500000
6   CBD     Retail  CORC 2.000000
7   CBD All CBD BU AUDIT 1.833333
8   CBD All CBD BU  CORC 2.428571
# Spread (does same thing as dcast, but using tidyr spread function)
dfs %>% spread(CC, avg)
     AU         BU    AUDIT     CORC
1   CBD All CBD BU 1.833333 2.428571
2   CBD     Constr 2.000000 2.800000
3   CBD        OAO 1.000000 2.400000
4   CBD     Retail 2.500000 2.000000

This can be combined into a single chain:

dfs = df %>%
  group_by(AU, BU, CC) %>%
  summarise(avg = mean(Score, na.rm = TRUE)) %>% 
  bind_rows(
    df %>%
      group_by(AU, CC) %>%
      summarise(avg = mean(Score, na.rm = TRUE)) %>% 
      mutate(BU = paste("All", AU,"BU"))
  ) %>% 
  spread(CC, avg)

Upvotes: 1

Related Questions