Adding Total column in R dataframe

Question

I have following Dataframe in R.

Type   Category     Mar-20       Apr-20       Feb-20
A      RTSF         25 (25.00%)  30 (30.00%)  50 (50.00%)
A      PRRT         30 (30.00%)  30 (30.00%)  20 (20.00%)
A      ATTR         15 (15.00%)  30 (30.00%)  15 (15.00%)
A      STRF         30 (30.00%)  10 (10.00%)  15 (15.00%)
B      RTSF         20 (20.00%)  25 (38.46%)  50 (50.00%)
B      PRRT         40 (40.00%)  5 (7.69%)    0 (0.00%)
B      ATTR         10 (10.00%)  25 (38.46%)  0 (0.00%)
B      STRF         30 (30.00%)  10 (15.38%)  50 (50.00%)

I want to add the Total column for every Type and arrange the Months in the ascending order from left to right.

Dput(df)<-

structure(list(Type = c("A", "A", "A", "A", 
"B", "B", "B", "B"), category = c("RTSF", 
"PRRT", "ATTR", "STRF", "RTSF", 
"PRRT", "ATTR", "STRF"), `Apr-21` = c("1544 (47.7%)", 
"330 (10.2%)", "771 (23.8%)", "589 (18.2%)", "60 (2.38%)", "2448 (97.30%)", 
"0 (0.00%)", "8 (0.32%)"), `Dec-20` = c("2547 (50.93%)", "15 (0.30%)", 
"2435 (48.69%)", "4 (0.08%)", "680 (97.8%)", "0 (0.0%)", "15 (2.2%)", 
"0 (0.0%)"), `Feb-21` = c("2099 (50.24%)", "14 (0.34%)", "2064 (49.40%)", 
"1 (0.02%)", "1702 (98.44%)", "19 (1.10%)", "7 (0.40%)", "1 (0.06%)"
), `Jan-21` = c("2084 (46.71%)", "20 (0.45%)", "2352 (52.71%)", 
"6 (0.13%)", "912 (99.35%)", "1 (0.11%)", "5 (0.54%)", "0 (0.00%)"
), `Mar-21` = c("1111 (33.5%)", "575 (17.3%)", "876 (26.4%)", 
"753 (22.7%)", "198 (7.8%)", "2239 (88.6%)", "1 (0.0%)", "90 (3.6%)"
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))

REQUIRED OUTPUT:

Type   Category  Feb-20         Mar-20        Apr-20       
    A      RTSF     50 (50.00%)    25 (25.00%)   30 (30.00%)
    A      PRRT     20 (20.00%)    30 (30.00%)   30 (30.00%)
    A      ATTR     15 (15.00%)    15 (15.00%)   30 (30.00%)
    A      STRF     15 (15.00%)    30 (30.00%)   10 (10.00%)
  Total    -        100 (100.00%)  100 (100.00%) 100 (100.00%)
    B      RTSF     50 (50.00%)    20 (20.00%)   25 (38.46%)
    B      PRRT     0 (0.00%)      40 (40.00%)   5 (7.69%)
    B      ATTR     0 (0.00%)      10 (10.00%)   25 (38.46%)
    B      STRF     50 (50.00%)    30 (30.00%)   10 (15.38%)
  Total    -        100 (100.00%)  100 (100.00%) 65 (100.00%)

petrucci4prez · Accepted Answer

You need some sort of string-parsing function that also handles the aggregation:

library(tidyverse)

data <- structure(list(Type = c("A", "A", "A", "A", 
"B", "B", "B", "B"), category = c("RTSF", 
"PRRT", "ATTR", "STRF", "RTSF", 
"PRRT", "ATTR", "STRF"), `Apr-21` = c("1544 (47.7%)", 
"330 (10.2%)", "771 (23.8%)", "589 (18.2%)", "60 (2.38%)", "2448 (97.30%)", 
"0 (0.00%)", "8 (0.32%)"), `Dec-20` = c("2547 (50.93%)", "15 (0.30%)", 
"2435 (48.69%)", "4 (0.08%)", "680 (97.8%)", "0 (0.0%)", "15 (2.2%)", 
"0 (0.0%)"), `Feb-21` = c("2099 (50.24%)", "14 (0.34%)", "2064 (49.40%)", 
"1 (0.02%)", "1702 (98.44%)", "19 (1.10%)", "7 (0.40%)", "1 (0.06%)"
), `Jan-21` = c("2084 (46.71%)", "20 (0.45%)", "2352 (52.71%)", 
"6 (0.13%)", "912 (99.35%)", "1 (0.11%)", "5 (0.54%)", "0 (0.00%)"
), `Mar-21` = c("1111 (33.5%)", "575 (17.3%)", "876 (26.4%)", 
"753 (22.7%)", "198 (7.8%)", "2239 (88.6%)", "1 (0.0%)", "90 (3.6%)"
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))

parse_field <- function(s) {
    res <- str_match(s, "([0-9]+) $([0-9]+\.[0-9]+)%$")
    x <- as.numeric(res[, 2])
    y <- as.numeric(res[, 3])
    sprintf("%d (%02.f%%)", sum(x), sum(y))
}

cnames <- data %>%
    select(-Type, -category) %>%
    names()


totals <- data %>%
    as_tibble() %>%
    group_by(Type) %>%
    summarize_at(cnames, parse_field) %>%
    mutate(tmpType = Type, category = "-", Type = "Total")

combined <- data %>%
    mutate(tmpType = Type) %>%
    bind_rows(totals) %>%
    group_by(tmpType) %>%
    arrange(Type, .by_group = TRUE) %>%
    ungroup() %>%
    select(-tmpType)

sorted_names <- cnames[order(as.Date(paste("01", cnames), "%d %b-%y"))]

select(combined, Type, category, sorted_names)

output:

# A tibble: 10 x 7
   Type  category `Dec-20`     `Jan-21`     `Feb-21`     `Mar-21`    `Apr-21`   
                                             
 1 A     RTSF     2547 (50.93… 2084 (46.71… 2099 (50.24… 1111 (33.5… 1544 (47.7…
 2 A     PRRT     15 (0.30%)   20 (0.45%)   14 (0.34%)   575 (17.3%) 330 (10.2%)
 3 A     ATTR     2435 (48.69… 2352 (52.71… 2064 (49.40… 876 (26.4%) 771 (23.8%)
 4 A     STRF     4 (0.08%)    6 (0.13%)    1 (0.02%)    753 (22.7%) 589 (18.2%)
 5 Total -        5001 (100%)  4462 (100%)  4178 (100%)  3315 (100%) 3234 (100%)
 6 B     RTSF     680 (97.8%)  912 (99.35%) 1702 (98.44… 198 (7.8%)  60 (2.38%) 
 7 B     PRRT     0 (0.0%)     1 (0.11%)    19 (1.10%)   2239 (88.6… 2448 (97.3…
 8 B     ATTR     15 (2.2%)    5 (0.54%)    7 (0.40%)    1 (0.0%)    0 (0.00%)  
 9 B     STRF     0 (0.0%)     0 (0.00%)    1 (0.06%)    90 (3.6%)   8 (0.32%)  
10 Total -        695 (100%)   918 (100%)   1729 (100%)  2528 (100%) 2516 (100%)

Explanation

Define parse function: parse_field will match the leading number and the number in the parentheses. Since str_match is vectorized and the argument s is a vector representing one of the columns, it will return a matrix which can then be summed and reformatted using sprintf.
Make a new dataframe with the totals: Take the original dataframe, group by the Type column, and aggregate within each group using parse_field for each column you care about. Change the category to a dash and Type to "Total" since that's how you want these displayed in the final output. Also make a new column holding the original type for step 3 called tmpType.
Combine totals with original data: Add a new column tmpType to hold the type analogously to what was done at the end of 2. Append the totals to the original data (at this point they are at the bottom so we need to sort). Group by tmpType and sort by Type, finally ungroup and remove the tmpType column.
Sort by column names: sort the column names of the original dataframe by date

Notes

The tmpType column is necessary because grouping on Type will also make Total its own group which will keep it at the bottom.
This works because the arrange will sort Total after A, B, etc. If you somehow have Types that go beyond T, you would need to make another column as the sort key column where the original data has all 0 and the totals data has all 1 (and sort 0 before 1 within each category).
The weird paste call in the line that sorts the names seems to be necessary since as.Date will return NA if you only give it a month and year. Appending 01 to each date basically just means parse the date at the beginning of each month, which seems fine for your case.

Adding Total column in R dataframe

Answers (2)

Related Questions