Reputation: 5

Calculating cumulative sum for multiple columns in R

R newb, I'm trying to calculate the cumulative sum grouped by year, month, group and subgroup, also having multiple columns to calculate.

Sample of the data:

df <- data.frame("Year"=2020,
                "Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
                "Group"=c("A","A","A","B","A","B","B","B"),
                "SubGroup"=c("a","a","b","b","a","b","a","b"),
                "V1"=c(10,10,20,20,50,50,10,10),
                "V2"=c(0,1,2,2,0,5,1,1))
    
       Year Month Group SubGroup V1 V2
    1 2020   Jan     A        a 10  0
    2 2020   Jan     A        a 10  1
    3 2020   Jan     A        b 20  2
    4 2020   Jan     B        b 20  2
    5 2020   Feb     A        a 50  0
    6 2020   Feb     B        b 50  5
    7 2020   Feb     B        a 10  1
    8 2020   Feb     B        b 10  1

Resulting Table wanted:

      Year Month Group SubGroup V1 V2
    1 2020   Jan     A        a 20  1
    2 2020   Feb     A        a 70  1
    3 2020   Jan     A        b 20  2
    4 2020   Feb     A        b 20  2
    5 2020   Jan     B        a  0  0
    6 2020   Feb     B        a 10  1
    7 2020   Jan     B        b 20  2
    8 2020   Feb     B        b 80  8

From Sample Table, on Jan 2020, the sum of Group 'A' Subgroup 'a' was 10+10 = 20... On Feb 2020, the value was 50, therefore 20 from Jan + 50 = 70, and so on...

If there is no value, it should consider 0.

I've tried few codes but none didn't get even close to the output I need. Would really appreciate if someone could help me with some tips for this problem.

Upvotes: 0

Answers (3)

Odysseus210

Reputation: 468

library(dplyr)
library(zoo)

df %>%
  arrange(as.yearmon(paste0(Year, '-', Month), '%Y-%b'), Group, SubGroup) %>%
  group_by(Year, Group, SubGroup) %>% 
  mutate(
         V1 = cumsum(V1),
         V2 = cumsum(V2)
       ) %>% 
  arrange(Year, Group, SubGroup, as.yearmon(paste0(Year, '-', Month), '%Y-%b')) #for desired output ordering

#  A tibble: 8 x 6
#  Groups:   Year, Group, SubGroup [4]
#   Year  Month Group SubGroup    V1    V2
#   <chr> <chr> <chr> <chr>    <dbl> <dbl>
# 1 2020  Jan   A     a           10     0
# 2 2020  Jan   A     a           20     1
# 3 2020  Feb   A     a           70     1
# 4 2020  Jan   A     b           20     2
# 5 2020  Feb   B     a           10     1
# 6 2020  Jan   B     b           20     2
# 7 2020  Feb   B     b           70     7
# 8 2020  Feb   B     b           80     8

Upvotes: 0

Charlie Gallagher

Reputation: 616

If I understand what you are doing, you're taking the sum for each month, then doing the cumulative sums for the months. This is usuaully pretty easy in dplyr.

library(dplyr)

df %>% 
  group_by(Year, Month, Group, SubGroup) %>% 
  summarize(
    V1_sum = sum(V1),
    V2_sum = sum(V2)
  ) %>% 
  group_by(Year, Group, SubGroup) %>% 
  mutate(
    V1_cumsum = cumsum(V1_sum),
    V2_cumsum = cumsum(V2_sum)
  )


# A tibble: 6 x 8
# Groups:   Year, Group, SubGroup [4]
#   Year Month Group SubGroup V1_sum V2_sum V1_cumsum V2_cumsum
#   <dbl> <chr> <chr> <chr>     <dbl>  <dbl>     <dbl>     <dbl>
# 1  2020 Feb   A     a            50      0        50         0
# 2  2020 Feb   B     a            10      1        10         1
# 3  2020 Feb   B     b            60      6        60         6
# 4  2020 Jan   A     a            20      1        70         1
# 5  2020 Jan   A     b            20      2        20         2
# 6  2020 Jan   B     b            20      2        80         8

But you'll notice that the monthly cumulative sums are backwards (i.e. January comes after February), because by default group_by groups alphabetically. Also, you don't see the empty values because dplyr doesn't fill them in.

To fix the order of the months, you can either make your months numeric (convert to dates) or turn them into factors. You can add back 'missing' combinations of the grouping variables by using aggregate in base R instead of dplyr::summarize. aggregate includes all combinations of the grouping factors. aggregate converts the missing values to NA, but you can replace the NA with 0 with tidyr::replace_na, for example.

library(dplyr)
library(tidyr)

df <- data.frame("Year"=2020,
                 "Month"=c("Jan","Jan","Jan","Jan","Feb","Feb","Feb","Feb"),
                 "Group"=c("A","A","A","B","A","B","B","B"),
                 "SubGroup"=c("a","a","b","b","a","b","a","b"),
                 "V1"=c(10,10,20,20,50,50,10,10),
                 "V2"=c(0,1,2,2,0,5,1,1))

df$Month <- factor(df$Month, levels = c("Jan", "Feb"), ordered = TRUE)

# Get monthly sums
df1 <- with(df, aggregate(
  list(V1_sum = V1, V2_sum = V2),
  list(Year = Year, Month = Month, Group = Group, SubGroup = SubGroup),
  FUN = sum, drop = FALSE
))

df1 <- df1 %>% 
  # Replace NA with 0
  mutate(
    V1_sum = replace_na(V1_sum, 0),
    V2_sum = replace_na(V2_sum, 0)
  ) %>% 
  # Get cumulative sum across months
  group_by(Year, Group, SubGroup) %>% 
  mutate(V1cumsum = cumsum(V1_sum), 
         V2cumsum = cumsum(V2_sum)) %>%
  ungroup() %>% 
  select(Year, Month, Group, SubGroup, V1 = V1cumsum, V2 = V2cumsum)

This gives the same result as your example:

# # A tibble: 8 x 6
#    Year Month Group SubGroup    V1    V2
#    <dbl> <ord> <chr> <chr>    <dbl> <dbl>
# 1  2020 Jan   A     a           20     1
# 2  2020 Feb   A     a           70     1
# 3  2020 Jan   B     a            0     0
# 4  2020 Feb   B     a           10     1
# 5  2020 Jan   A     b           20     2
# 6  2020 Feb   A     b           20     2
# 7  2020 Jan   B     b           20     2
# 8  2020 Feb   B     b           80     8

Upvotes: 1

Rui Barradas

Reputation: 76651

This is a simple group_by/mutate problem. The columns V1, V2 are chosen with across and cumsum applied to them.

df$Month <- factor(df$Month, levels = c("Jan", "Feb"))

df %>%
  group_by(Year, Group, SubGroup) %>%
  mutate(across(V1:V2, ~cumsum(.x))) %>%
  ungroup() %>%
  arrange(Year, Group, SubGroup, Month)
## A tibble: 8 x 6
#  Year  Month Group SubGroup    V1    V2
#  <chr> <fct> <chr> <chr>    <dbl> <dbl>
#1 2020  Jan   A     a           10     0
#2 2020  Jan   A     a           20     1
#3 2020  Feb   A     a           70     1
#4 2020  Jan   A     b           20     2
#5 2020  Feb   B     a           10     1
#6 2020  Jan   B     b           20     2
#7 2020  Feb   B     b           70     7
#8 2020  Feb   B     b           80     8

Upvotes: 2

Calculating cumulative sum for multiple columns in R

Answers (3)

Related Questions