Mattew Noh
Mattew Noh

Reputation: 23

Conditional cumulative sum associated with the date in R

I want to compute cumulative sum within 5 days for each group.

df <- data.frame(
  date = ymd( c( "2022-01-02","2022-01-03","2022-01-05","2022-01-07","2022-01-11","2022-01-14","2022-01-17","2022-01-18","2022-01-24","2022-01-27","2022-01-01","2022-01-04","2022-01-04","2022-01-08","2022-01-12","2022-01-14","2022-01-19","2022-01-24","2022-01-25","2022-01-28")),
  group = c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B"),
  number = c(10,30,20,50,30,50,40,50,30,50,55,10,30,20,50,30,40,30,40,30))

A small sample of my data frame is below including what the cumulative sum column should return. Any help would be appreciated. Thanks.

date       group number cumsum(s)
2022-01-02 A     10     10
2022-01-03 A     30     40
2022-01-05 A     20     60
2022-01-07 A     50     110
2022-01-11 A     30     80
2022-01-14 A     50     80
2022-01-17 A     40     90
2022-01-18 A     50     140
2022-01-24 A     30     30
2022-01-27 A     50     80
2022-01-01 B     55     55
2022-01-04 B     10     65
2022-01-04 B     30     95
2022-01-08 B     20     60
2022-01-12 B     50     70
2022-01-14 B     30     80
2022-01-19 B     40     70
2022-01-24 B     30     70
2022-01-25 B     40     70
2022-01-28 B     30     100

I tried to use map() and cumsum() but failed.

Upvotes: 2

Views: 193

Answers (3)

langtang
langtang

Reputation: 24722

You can join the data on itself, using a non-equi join in data.table, and using .EACHI to estimate the cumulative sum

library(data.table)

df = setDT(df)[, d:=date-5][]

cbind(
  df[df,on=.(group,date<=date, date>=d), .(result = sum(number)), .EACHI][, .(group,date,result)],
  df[, .(number)]
)

Output:

    group       date result number
 1:     A 2022-01-02     10     10
 2:     A 2022-01-03     40     30
 3:     A 2022-01-05     60     20
 4:     A 2022-01-07    110     50
 5:     A 2022-01-11     80     30
 6:     A 2022-01-14     80     50
 7:     A 2022-01-17     90     40
 8:     A 2022-01-18    140     50
 9:     A 2022-01-24     30     30
10:     A 2022-01-27     80     50
11:     B 2022-01-01     55     55
12:     B 2022-01-04     95     10
13:     B 2022-01-04     95     30
14:     B 2022-01-08     60     20
15:     B 2022-01-12     70     50
16:     B 2022-01-14     80     30
17:     B 2022-01-19     70     40
18:     B 2022-01-24     70     30
19:     B 2022-01-25     70     40
20:     B 2022-01-28    100     30

Update - alternative approach

Let's say we want to do this for multiple shift values. Another way to do this is to get the overall cumulative sum for each group, over the entire set of days (including days not represented), and use shift.

  1. Load data.table and set the table using `setDT()
library(data.table)
setDT(df)
  1. Expand each group to min and max date, and get overall cumulative sum, per group
df = df[df[, data.table(date=seq(min(date), max(date),1)), group], on=.(group,date)][
  ,cs:=cumsum(fifelse(is.na(number),0,number)),group]
  1. Now, create a helper function, f, that takes a shift value (say 5, or 10) and returns the cumulative sum for that shift
f <- function(s,cs) {
  r = cs-shift(cs,s+1)
  r[is.na(r)] <- cs[is.na(r)]
  r
}
  1. Set a vector of your desired shifts, and apply the function to each of those shift values
s = c(3,5,7,10)
df[, (paste0("c",s)):=lapply(s, f,cs=cs),group][!is.na(number)]

Output:

          date  group number    cs    c3    c5    c7   c10
        <Date> <char>  <num> <num> <num> <num> <num> <num>
 1: 2022-01-02      A     10    10    10    10    10    10
 2: 2022-01-03      A     30    40    40    40    40    40
 3: 2022-01-05      A     20    60    60    60    60    60
 4: 2022-01-07      A     50   110    70   110   110   110
 5: 2022-01-11      A     30   140    30    80   100   140
 6: 2022-01-14      A     50   190    80    80   130   150
 7: 2022-01-17      A     40   230    90    90   120   170
 8: 2022-01-18      A     50   280    90   140   170   170
 9: 2022-01-24      A     30   310    30    30   120   170
10: 2022-01-27      A     50   360    80    80    80   170
11: 2022-01-01      B     55    55    55    55    55    55
12: 2022-01-04      B     10    65    65    65    65    65
13: 2022-01-04      B     30    95    40    95    95    95
14: 2022-01-08      B     20   115    20    60    60   115
15: 2022-01-12      B     50   165    50    70    70   110
16: 2022-01-14      B     30   195    80    80   100   130
17: 2022-01-19      B     40   235    40    70   120   120
18: 2022-01-24      B     30   265    30    70    70   100
19: 2022-01-25      B     40   305    70    70   110   110
20: 2022-01-28      B     30   335    70   100   100   140
          date  group number    cs    c3    c5    c7   c10

Upvotes: 5

Allan Cameron
Allan Cameron

Reputation: 173783

This gives the expected output:

library(tidyverse)

df %>%
  group_by(group) %>%
  mutate(cumsum = sapply(date, \(x) sum(number[date >= (x - 5) & date <= x])))
#> # A tibble: 20 x 4
#> # Groups:   group [2]
#>    date       group number cumsum
#>    <date>     <chr>  <dbl>  <dbl>
#>  1 2022-01-02 A         10     10
#>  2 2022-01-03 A         30     40
#>  3 2022-01-05 A         20     60
#>  4 2022-01-07 A         50    110
#>  5 2022-01-11 A         30     80
#>  6 2022-01-14 A         50     80
#>  7 2022-01-17 A         40     90
#>  8 2022-01-18 A         50    140
#>  9 2022-01-24 A         30     30
#> 10 2022-01-27 A         50     80
#> 11 2022-01-01 B         55     55
#> 12 2022-01-04 B         10     95
#> 13 2022-01-04 B         30     95
#> 14 2022-01-08 B         20     60
#> 15 2022-01-12 B         50     70
#> 16 2022-01-14 B         30     80
#> 17 2022-01-19 B         40     70
#> 18 2022-01-24 B         30     70
#> 19 2022-01-25 B         40     70
#> 20 2022-01-28 B         30    100

Created on 2022-10-30 with reprex v2.0.2

Upvotes: 6

akrun
akrun

Reputation: 886928

We could get the difference of date to create a logical vector, get the cumulative sum and use that in grouping and get the cumsum of 'number'

library(dplyr)
df %>%
   group_by(group) %>%
   mutate(difgrp =cumsum(c(TRUE, diff(date) >=4))) %>% 
   group_by(diffgrp, .add = TRUE) %>% 
   mutate(s = cumsum(number)) %>% 
   ungroup %>% 
   select(-diffgrp)

Upvotes: 3

Related Questions