Reputation: 111

How to create a cumulative sum that is grouped by 2 different columns

Say I have the following dataframe, initialized as df:

ID     date     value
A      01/2012  1
A      03/2012  2
A      05/2012  4
A      07/2012  3
A      09/2012  7
A      11/2012  1
A      01/2013  2
A      03/2013  8
A      05/2013  13
A      07/2013  2
A      09/2013  5
A      11/2013  2
B      01/2012  3
B      03/2012  9
B      05/2012  1
B      07/2012  0
B      09/2012  12
B      11/2012  3
B      01/2013  1
B      03/2013  4
B      05/2013  3
B      07/2013  3
B      09/2013  1
B      11/2013  1

Where the date variable is of the form mm/yyyy. I am looking for a way to create an additional column in this data frame that gives cumulative sum of the value column grouped by ID and year. For example, the following output is what I want:

ID     date     value  cumsum
A      01/2012  1      1
A      03/2012  2      3
A      05/2012  4      7
A      07/2012  3      10
A      09/2012  7      17
A      11/2012  1      18
A      01/2013  2      2
A      03/2013  8      10
A      05/2013  13     23
A      07/2013  2      25
A      09/2013  5      30
A      11/2013  2      32
B      01/2012  3      3
B      03/2012  9      12
B      05/2012  1      13
B      07/2012  0      13
B      09/2012  12     25
B      11/2012  3      28
B      01/2013  1      1
B      03/2013  4      5
B      05/2013  3      8
B      07/2013  3      11
B      09/2013  1      12
B      11/2013  1      13

As you can see, the sum resets for each year but also for each ID. Essentially I am unsure of how to create a cumulative sum that is grouped by 2 columns instead of just 1. Any help would be appreciated

Upvotes: 0

Answers (4)

Ronak Shah

Reputation: 389335

For completion here is a base R option -

transform(df, cumsum = ave(value, ID, sub('.*/', '', date), FUN = cumsum))

#   ID    date value cumsum
#1   A 01/2012     1      1
#2   A 03/2012     2      3
#3   A 05/2012     4      7
#4   A 07/2012     3     10
#5   A 09/2012     7     17
#6   A 11/2012     1     18
#7   A 01/2013     2      2
#8   A 03/2013     8     10
#9   A 05/2013    13     23
#10  A 07/2013     2     25
#11  A 09/2013     5     30
#12  A 11/2013     2     32
#13  B 01/2012     3      3
#14  B 03/2012     9     12
#15  B 05/2012     1     13
#16  B 07/2012     0     13
#17  B 09/2012    12     25
#18  B 11/2012     3     28
#19  B 01/2013     1      1
#20  B 03/2013     4      5
#21  B 05/2013     3      8
#22  B 07/2013     3     11
#23  B 09/2013     1     12
#24  B 11/2013     1     13

Upvotes: 1

Peace Wang

Reputation: 2419

Optional solution:

library(data.tabe)
setDT(df)[,cumsum:=cumsum(value),by=.(ID,substr(date, 4,8))][]

Upvotes: 2

akrun

Reputation: 887981

We convert to Date class and extract the year part to create grouping column before we do the cumulative sum

library(dplyr)
library(lubridate)
df1 %>% 
    group_by(ID, year = year(my(date))) %>%
     mutate(cumsum = cumsum(value)) %>% 
     ungroup %>% 
     select(-year)
# A tibble: 24 x 4
   ID    date    value cumsum
   <chr> <chr>   <int>  <int>
 1 A     01/2012     1      1
 2 A     03/2012     2      3
 3 A     05/2012     4      7
 4 A     07/2012     3     10
 5 A     09/2012     7     17
 6 A     11/2012     1     18
 7 A     01/2013     2      2
 8 A     03/2013     8     10
 9 A     05/2013    13     23
10 A     07/2013     2     25
# … with 14 more rows

data

df1 <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B"), date = c("01/2012", "03/2012", "05/2012", "07/2012", 
"09/2012", "11/2012", "01/2013", "03/2013", "05/2013", "07/2013", 
"09/2013", "11/2013", "01/2012", "03/2012", "05/2012", "07/2012", 
"09/2012", "11/2012", "01/2013", "03/2013", "05/2013", "07/2013", 
"09/2013", "11/2013"), value = c(1L, 2L, 4L, 3L, 7L, 1L, 2L, 
8L, 13L, 2L, 5L, 2L, 3L, 9L, 1L, 0L, 12L, 3L, 1L, 4L, 3L, 3L, 
1L, 1L)), class = "data.frame", row.names = c(NA, -24L))

Upvotes: 0

Onyambu

Reputation: 79358

You could do:

df %>%
  group_by(ID, year = substr(date, 4,8)) %>%
  mutate(cumsum = cumsum(value))

# A tibble: 24 x 5
# Groups:   ID, year [4]
   ID    date    value year  cumsum
   <chr> <chr>   <int> <chr>  <int>
 1 A     01/2012     1 2012       1
 2 A     03/2012     2 2012       3
 3 A     05/2012     4 2012       7
 4 A     07/2012     3 2012      10
 5 A     09/2012     7 2012      17
 6 A     11/2012     1 2012      18
 7 A     01/2013     2 2013       2
 8 A     03/2013     8 2013      10
 9 A     05/2013    13 2013      23
10 A     07/2013     2 2013      25

Upvotes: 0

How to create a cumulative sum that is grouped by 2 different columns

Answers (4)

data

Related Questions