Reputation: 23
I want to compute cumulative sum within 5 days for each group.
df <- data.frame(
date = ymd( c( "2022-01-02","2022-01-03","2022-01-05","2022-01-07","2022-01-11","2022-01-14","2022-01-17","2022-01-18","2022-01-24","2022-01-27","2022-01-01","2022-01-04","2022-01-04","2022-01-08","2022-01-12","2022-01-14","2022-01-19","2022-01-24","2022-01-25","2022-01-28")),
group = c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B"),
number = c(10,30,20,50,30,50,40,50,30,50,55,10,30,20,50,30,40,30,40,30))
A small sample of my data frame is below including what the cumulative sum column should return. Any help would be appreciated. Thanks.
date group number cumsum(s)
2022-01-02 A 10 10
2022-01-03 A 30 40
2022-01-05 A 20 60
2022-01-07 A 50 110
2022-01-11 A 30 80
2022-01-14 A 50 80
2022-01-17 A 40 90
2022-01-18 A 50 140
2022-01-24 A 30 30
2022-01-27 A 50 80
2022-01-01 B 55 55
2022-01-04 B 10 65
2022-01-04 B 30 95
2022-01-08 B 20 60
2022-01-12 B 50 70
2022-01-14 B 30 80
2022-01-19 B 40 70
2022-01-24 B 30 70
2022-01-25 B 40 70
2022-01-28 B 30 100
I tried to use map() and cumsum() but failed.
Upvotes: 2
Views: 193
Reputation: 24722
You can join the data on itself, using a non-equi join in data.table
, and using .EACHI
to estimate the cumulative sum
library(data.table)
df = setDT(df)[, d:=date-5][]
cbind(
df[df,on=.(group,date<=date, date>=d), .(result = sum(number)), .EACHI][, .(group,date,result)],
df[, .(number)]
)
Output:
group date result number
1: A 2022-01-02 10 10
2: A 2022-01-03 40 30
3: A 2022-01-05 60 20
4: A 2022-01-07 110 50
5: A 2022-01-11 80 30
6: A 2022-01-14 80 50
7: A 2022-01-17 90 40
8: A 2022-01-18 140 50
9: A 2022-01-24 30 30
10: A 2022-01-27 80 50
11: B 2022-01-01 55 55
12: B 2022-01-04 95 10
13: B 2022-01-04 95 30
14: B 2022-01-08 60 20
15: B 2022-01-12 70 50
16: B 2022-01-14 80 30
17: B 2022-01-19 70 40
18: B 2022-01-24 70 30
19: B 2022-01-25 70 40
20: B 2022-01-28 100 30
Let's say we want to do this for multiple shift values. Another way to do this is to get the overall cumulative sum for each group, over the entire set of days (including days not represented), and use shift.
library(data.table)
setDT(df)
df = df[df[, data.table(date=seq(min(date), max(date),1)), group], on=.(group,date)][
,cs:=cumsum(fifelse(is.na(number),0,number)),group]
f <- function(s,cs) {
r = cs-shift(cs,s+1)
r[is.na(r)] <- cs[is.na(r)]
r
}
s = c(3,5,7,10)
df[, (paste0("c",s)):=lapply(s, f,cs=cs),group][!is.na(number)]
Output:
date group number cs c3 c5 c7 c10
<Date> <char> <num> <num> <num> <num> <num> <num>
1: 2022-01-02 A 10 10 10 10 10 10
2: 2022-01-03 A 30 40 40 40 40 40
3: 2022-01-05 A 20 60 60 60 60 60
4: 2022-01-07 A 50 110 70 110 110 110
5: 2022-01-11 A 30 140 30 80 100 140
6: 2022-01-14 A 50 190 80 80 130 150
7: 2022-01-17 A 40 230 90 90 120 170
8: 2022-01-18 A 50 280 90 140 170 170
9: 2022-01-24 A 30 310 30 30 120 170
10: 2022-01-27 A 50 360 80 80 80 170
11: 2022-01-01 B 55 55 55 55 55 55
12: 2022-01-04 B 10 65 65 65 65 65
13: 2022-01-04 B 30 95 40 95 95 95
14: 2022-01-08 B 20 115 20 60 60 115
15: 2022-01-12 B 50 165 50 70 70 110
16: 2022-01-14 B 30 195 80 80 100 130
17: 2022-01-19 B 40 235 40 70 120 120
18: 2022-01-24 B 30 265 30 70 70 100
19: 2022-01-25 B 40 305 70 70 110 110
20: 2022-01-28 B 30 335 70 100 100 140
date group number cs c3 c5 c7 c10
Upvotes: 5
Reputation: 173783
This gives the expected output:
library(tidyverse)
df %>%
group_by(group) %>%
mutate(cumsum = sapply(date, \(x) sum(number[date >= (x - 5) & date <= x])))
#> # A tibble: 20 x 4
#> # Groups: group [2]
#> date group number cumsum
#> <date> <chr> <dbl> <dbl>
#> 1 2022-01-02 A 10 10
#> 2 2022-01-03 A 30 40
#> 3 2022-01-05 A 20 60
#> 4 2022-01-07 A 50 110
#> 5 2022-01-11 A 30 80
#> 6 2022-01-14 A 50 80
#> 7 2022-01-17 A 40 90
#> 8 2022-01-18 A 50 140
#> 9 2022-01-24 A 30 30
#> 10 2022-01-27 A 50 80
#> 11 2022-01-01 B 55 55
#> 12 2022-01-04 B 10 95
#> 13 2022-01-04 B 30 95
#> 14 2022-01-08 B 20 60
#> 15 2022-01-12 B 50 70
#> 16 2022-01-14 B 30 80
#> 17 2022-01-19 B 40 70
#> 18 2022-01-24 B 30 70
#> 19 2022-01-25 B 40 70
#> 20 2022-01-28 B 30 100
Created on 2022-10-30 with reprex v2.0.2
Upvotes: 6
Reputation: 886928
We could get the diff
erence of date
to create a logical vector, get the cumulative sum and use that in grouping and get the cumsum
of 'number'
library(dplyr)
df %>%
group_by(group) %>%
mutate(difgrp =cumsum(c(TRUE, diff(date) >=4))) %>%
group_by(diffgrp, .add = TRUE) %>%
mutate(s = cumsum(number)) %>%
ungroup %>%
select(-diffgrp)
Upvotes: 3