Sophia Wilson
Sophia Wilson

Reputation: 477

Adding Total column in R dataframe

I have following Dataframe in R.

Type   Category     Mar-20       Apr-20       Feb-20
A      RTSF         25 (25.00%)  30 (30.00%)  50 (50.00%)
A      PRRT         30 (30.00%)  30 (30.00%)  20 (20.00%)
A      ATTR         15 (15.00%)  30 (30.00%)  15 (15.00%)
A      STRF         30 (30.00%)  10 (10.00%)  15 (15.00%)
B      RTSF         20 (20.00%)  25 (38.46%)  50 (50.00%)
B      PRRT         40 (40.00%)  5 (7.69%)    0 (0.00%)
B      ATTR         10 (10.00%)  25 (38.46%)  0 (0.00%)
B      STRF         30 (30.00%)  10 (15.38%)  50 (50.00%)

I want to add the Total column for every Type and arrange the Months in the ascending order from left to right.

Dput(df)<-

structure(list(Type = c("A", "A", "A", "A", 
"B", "B", "B", "B"), category = c("RTSF", 
"PRRT", "ATTR", "STRF", "RTSF", 
"PRRT", "ATTR", "STRF"), `Apr-21` = c("1544 (47.7%)", 
"330 (10.2%)", "771 (23.8%)", "589 (18.2%)", "60 (2.38%)", "2448 (97.30%)", 
"0 (0.00%)", "8 (0.32%)"), `Dec-20` = c("2547 (50.93%)", "15 (0.30%)", 
"2435 (48.69%)", "4 (0.08%)", "680 (97.8%)", "0 (0.0%)", "15 (2.2%)", 
"0 (0.0%)"), `Feb-21` = c("2099 (50.24%)", "14 (0.34%)", "2064 (49.40%)", 
"1 (0.02%)", "1702 (98.44%)", "19 (1.10%)", "7 (0.40%)", "1 (0.06%)"
), `Jan-21` = c("2084 (46.71%)", "20 (0.45%)", "2352 (52.71%)", 
"6 (0.13%)", "912 (99.35%)", "1 (0.11%)", "5 (0.54%)", "0 (0.00%)"
), `Mar-21` = c("1111 (33.5%)", "575 (17.3%)", "876 (26.4%)", 
"753 (22.7%)", "198 (7.8%)", "2239 (88.6%)", "1 (0.0%)", "90 (3.6%)"
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))

REQUIRED OUTPUT:

Type   Category  Feb-20         Mar-20        Apr-20       
    A      RTSF     50 (50.00%)    25 (25.00%)   30 (30.00%)
    A      PRRT     20 (20.00%)    30 (30.00%)   30 (30.00%)
    A      ATTR     15 (15.00%)    15 (15.00%)   30 (30.00%)
    A      STRF     15 (15.00%)    30 (30.00%)   10 (10.00%)
  Total    -        100 (100.00%)  100 (100.00%) 100 (100.00%)
    B      RTSF     50 (50.00%)    20 (20.00%)   25 (38.46%)
    B      PRRT     0 (0.00%)      40 (40.00%)   5 (7.69%)
    B      ATTR     0 (0.00%)      10 (10.00%)   25 (38.46%)
    B      STRF     50 (50.00%)    30 (30.00%)   10 (15.38%)
  Total    -        100 (100.00%)  100 (100.00%) 65 (100.00%)

Upvotes: 1

Views: 417

Answers (2)

TarJae
TarJae

Reputation: 78927

Here is another working solution:

Logic:

  1. clean_names with janitor package
  2. get long format to separate counts from percent
  3. back to wide format
  4. prepare for map_df from purrr with adorn_totals
  5. adorn_totals
  6. tweak colnames
library(tidyverse)
library(janitor)
df %>% 
  clean_names() %>% 
  pivot_longer(
    cols = 3:7,
    names_to = "Date",
    values_to = "Values"
  ) %>% 
  separate(Values, c("Values_N","Values_Percent"), sep=" ") %>% 
  pivot_wider(
    names_from = Date,
    values_from = c(Values_N, Values_Percent) 
  ) %>% 
  type.convert(as.is=TRUE) %>% 
  split( .[,"type"] ) %>%
  purrr::map_df(., janitor::adorn_totals) %>% 
  select(Type=type, Category=category, 
         `Dez-20_Count`=Values_N_dec_20,
         `Dez-20_Percent`=Values_Percent_dec_20,
         `Jan-21_Count`=Values_N_jan_21,
         `Jan-21_Percent`=Values_Percent_jan_21,
         `Feb-21_Count`=Values_N_feb_21,
         `Feb-21_Percent`=Values_Percent_feb_21,
         `Mar-21_Count`=Values_N_mar_21,
         `Mar-21_Percent`=Values_Percent_mar_21,
         `Apr-21_Count`=Values_N_apr_21,
         `Apr-21_Percent`=Values_Percent_apr_21)

Output:

  Type Category Dez-20_Count Dez-20_Percent Jan-21_Count Jan-21_Percent Feb-21_Count Feb-21_Percent Mar-21_Count Mar-21_Percent Apr-21_Count Apr-21_Percent
     A     RTSF         2547       (50.93%)         2084       (46.71%)         2099       (50.24%)         1111        (33.5%)         1544        (47.7%)
     A     PRRT           15        (0.30%)           20        (0.45%)           14        (0.34%)          575        (17.3%)          330        (10.2%)
     A     ATTR         2435       (48.69%)         2352       (52.71%)         2064       (49.40%)          876        (26.4%)          771        (23.8%)
     A     STRF            4        (0.08%)            6        (0.13%)            1        (0.02%)          753        (22.7%)          589        (18.2%)
 Total        -         5001              -         4462              -         4178              -         3315              -         3234              -
     B     RTSF          680        (97.8%)          912       (99.35%)         1702       (98.44%)          198         (7.8%)           60        (2.38%)
     B     PRRT            0         (0.0%)            1        (0.11%)           19        (1.10%)         2239        (88.6%)         2448       (97.30%)
     B     ATTR           15         (2.2%)            5        (0.54%)            7        (0.40%)            1         (0.0%)            0        (0.00%)
     B     STRF            0         (0.0%)            0        (0.00%)            1        (0.06%)           90         (3.6%)            8        (0.32%)
 Total        -          695              -          918              -         1729              -         2528              -         2516              -

Upvotes: 0

petrucci4prez
petrucci4prez

Reputation: 508

You need some sort of string-parsing function that also handles the aggregation:

library(tidyverse)

data <- structure(list(Type = c("A", "A", "A", "A", 
"B", "B", "B", "B"), category = c("RTSF", 
"PRRT", "ATTR", "STRF", "RTSF", 
"PRRT", "ATTR", "STRF"), `Apr-21` = c("1544 (47.7%)", 
"330 (10.2%)", "771 (23.8%)", "589 (18.2%)", "60 (2.38%)", "2448 (97.30%)", 
"0 (0.00%)", "8 (0.32%)"), `Dec-20` = c("2547 (50.93%)", "15 (0.30%)", 
"2435 (48.69%)", "4 (0.08%)", "680 (97.8%)", "0 (0.0%)", "15 (2.2%)", 
"0 (0.0%)"), `Feb-21` = c("2099 (50.24%)", "14 (0.34%)", "2064 (49.40%)", 
"1 (0.02%)", "1702 (98.44%)", "19 (1.10%)", "7 (0.40%)", "1 (0.06%)"
), `Jan-21` = c("2084 (46.71%)", "20 (0.45%)", "2352 (52.71%)", 
"6 (0.13%)", "912 (99.35%)", "1 (0.11%)", "5 (0.54%)", "0 (0.00%)"
), `Mar-21` = c("1111 (33.5%)", "575 (17.3%)", "876 (26.4%)", 
"753 (22.7%)", "198 (7.8%)", "2239 (88.6%)", "1 (0.0%)", "90 (3.6%)"
)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))

parse_field <- function(s) {
    res <- str_match(s, "([0-9]+) \\(([0-9]+\\.[0-9]+)%\\)")
    x <- as.numeric(res[, 2])
    y <- as.numeric(res[, 3])
    sprintf("%d (%02.f%%)", sum(x), sum(y))
}

cnames <- data %>%
    select(-Type, -category) %>%
    names()


totals <- data %>%
    as_tibble() %>%
    group_by(Type) %>%
    summarize_at(cnames, parse_field) %>%
    mutate(tmpType = Type, category = "-", Type = "Total")

combined <- data %>%
    mutate(tmpType = Type) %>%
    bind_rows(totals) %>%
    group_by(tmpType) %>%
    arrange(Type, .by_group = TRUE) %>%
    ungroup() %>%
    select(-tmpType)

sorted_names <- cnames[order(as.Date(paste("01", cnames), "%d %b-%y"))]

select(combined, Type, category, sorted_names)

output:

# A tibble: 10 x 7
   Type  category `Dec-20`     `Jan-21`     `Feb-21`     `Mar-21`    `Apr-21`   
   <chr> <chr>    <chr>        <chr>        <chr>        <chr>       <chr>      
 1 A     RTSF     2547 (50.93… 2084 (46.71… 2099 (50.24… 1111 (33.5… 1544 (47.7…
 2 A     PRRT     15 (0.30%)   20 (0.45%)   14 (0.34%)   575 (17.3%) 330 (10.2%)
 3 A     ATTR     2435 (48.69… 2352 (52.71… 2064 (49.40… 876 (26.4%) 771 (23.8%)
 4 A     STRF     4 (0.08%)    6 (0.13%)    1 (0.02%)    753 (22.7%) 589 (18.2%)
 5 Total -        5001 (100%)  4462 (100%)  4178 (100%)  3315 (100%) 3234 (100%)
 6 B     RTSF     680 (97.8%)  912 (99.35%) 1702 (98.44… 198 (7.8%)  60 (2.38%) 
 7 B     PRRT     0 (0.0%)     1 (0.11%)    19 (1.10%)   2239 (88.6… 2448 (97.3…
 8 B     ATTR     15 (2.2%)    5 (0.54%)    7 (0.40%)    1 (0.0%)    0 (0.00%)  
 9 B     STRF     0 (0.0%)     0 (0.00%)    1 (0.06%)    90 (3.6%)   8 (0.32%)  
10 Total -        695 (100%)   918 (100%)   1729 (100%)  2528 (100%) 2516 (100%)

Explanation

  1. Define parse function: parse_field will match the leading number and the number in the parentheses. Since str_match is vectorized and the argument s is a vector representing one of the columns, it will return a matrix which can then be summed and reformatted using sprintf.
  2. Make a new dataframe with the totals: Take the original dataframe, group by the Type column, and aggregate within each group using parse_field for each column you care about. Change the category to a dash and Type to "Total" since that's how you want these displayed in the final output. Also make a new column holding the original type for step 3 called tmpType.
  3. Combine totals with original data: Add a new column tmpType to hold the type analogously to what was done at the end of 2. Append the totals to the original data (at this point they are at the bottom so we need to sort). Group by tmpType and sort by Type, finally ungroup and remove the tmpType column.
  4. Sort by column names: sort the column names of the original dataframe by date

Notes

  • The tmpType column is necessary because grouping on Type will also make Total its own group which will keep it at the bottom.
  • This works because the arrange will sort Total after A, B, etc. If you somehow have Types that go beyond T, you would need to make another column as the sort key column where the original data has all 0 and the totals data has all 1 (and sort 0 before 1 within each category).
  • The weird paste call in the line that sorts the names seems to be necessary since as.Date will return NA if you only give it a month and year. Appending 01 to each date basically just means parse the date at the beginning of each month, which seems fine for your case.

Upvotes: 2

Related Questions