r.bot
r.bot

Reputation: 5424

Cumulative sum by rolling window of time

I have data on spells with start and end dates, by a person identifier.

temp <- structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1", 
"2"), class = "factor"), spell = c(1L, 2L, 3L, 1L, 2L, 3L), date1 = structure(c(14611, 
14654, 15141, 14853, 14867, 14975), class = "Date"), date2 = structure(c(14612, 
14656, 15142, 14862, 14872, 14976), class = "Date")), class = "data.frame", .Names = c("id", 
"spell", "date1", "date2"), row.names = c(NA, -6L))

I would like to calculate a rolling total of the days in spells within the last year (defined by a 365 day window), so that each row gets a total number of days that are included in the spells, prior to the current, within one year.

So in my example, person 1 has 3 spells, two of which occurred within a one year window (rows 1 and 2). The first spell has no prior and therefore the days.observed.in.past.yr is 0. The second has a prior spell (row 1), of 1 days length, so days.observed.in.past.yr is 0.

Person 2 has three spells, all occurring within one year. For spell 2, person 2, the prior spell was nine days long and so days.observed.in.past.yr is 9. Then for spell three, there are two prior spells and days.observed.in.past.yr is 14 (9+5). This would continue adding for as many spells there are within a 365 day window for the current row.

id  spell   date1       date2       days.observed.in.past.yr
1     1     2010-01-02 2010-01-03   0
1     2     2010-02-14 2010-02-16   1
1     3     2011-06-16 2011-06-17   0
2     1     2010-09-01 2010-09-10   0
2     2     2010-09-15 2010-09-20   9
2     3     2011-01-01 2011-01-02   14

However, beyond calculating the trivial things like spell length and the date one year earlier I've got no idea how to do this. The nearest similar problem that I've found is rgolf:rolling window but I don't know how to apply this to my problem.

Can anyone help?

Upvotes: 1

Views: 3144

Answers (2)

GoGonzo
GoGonzo

Reputation: 2877

This might be possible to achieve with runner package. OP required rolling sum over last 365 days excluding current observation. sum_run calculate sum of date2 - date1 in 365 days period. To exclude current row one can substract - spell_days or use a lag = 1 (exclude current day).

library(dplyr)
library(runner)

temp %>%
  group_by(id) %>%
  mutate(

    spell_days = date2 - date1,

    days.observed.in.past.yr = sum_run(
      x = spell_days, 
      k = 365, 
      idx = date1
    ) - spell_days

  )

#   id    spell date1      date2      spell_days days.observed.in.past.yr
#   <fct> <int> <date>     <date>     <drtn>     <drtn>                  
# 1 1         1 2010-01-02 2010-01-03 1 days      0 days                 
# 2 1         2 2010-02-14 2010-02-16 2 days      1 days                 
# 3 1         3 2011-06-16 2011-06-17 1 days      0 days                 
# 4 2         1 2010-09-01 2010-09-10 9 days      0 days                 
# 5 2         2 2010-09-15 2010-09-20 5 days      9 days                 
# 6 2         3 2011-01-01 2011-01-02 1 days     14 days 

Upvotes: 0

talat
talat

Reputation: 70336

Here's one way to do that using dplyr:

require(dplyr)

temp %>%
  mutate(year1 = format(date1, "%Y"),
         year2 = format(date2, "%Y")) %>%
  group_by(id) %>%
  mutate(count = ifelse(lag(year1, 1, default = 0) == year1, lag(date2, 1) - lag(date1,1), 0 )) %>%
  select(-c(year1, year2))

#Source: local data frame [5 x 4]
#Groups: id
#
#  id      date1      date2 count
#1  1 2010-01-02 2010-01-03     0
#2  1 2010-02-14 2010-02-16     1
#3  1 2011-06-16 2011-06-17     0
#4  2 2010-09-01 2010-09-10     0
#5  2 2010-09-15 2010-09-20     9

Edit 1 after comment

To do this using a 365 day "rolling" window, you can use the following:

temp %>%
  group_by(id) %>%
  mutate(count = ifelse(date1 - lag(date1, 1, default = 0) <= 365, lag(date2, 1) - lag(date1,1), 0))

The result is the same as above for the sample data.

Edit 2

I thought about this again and had some doubts that the first edit actually functions as expected since it only looks into the previous row to check if that row is less than 365 days before the current date1. So I came up with another version which looks into all rows per id, creates groups of data that is within a 365 day window and then sums up the date differences - maybe this is what you want.

df %>%
  group_by(id) %>%
  arrange(id, date1) %>%
  mutate(delta = floor(c(0, diff(date1)) / 365),
         delta = cumsum(delta)) %>%
  group_by(delta, add = TRUE) %>%
  mutate(count = cumsum(as.numeric(date2-date1)) - (date2 - date1)) %>%
  ungroup() %>%
  select(-delta)

The question that is not really clear to me is whether you only want to sum up date differences in the 1 previous row (if it exists and it is less than 365 days earlier, in that case edit 1 should work) or do you want to sum up all previous differences that are less than 365 days earlier (in that case edit 2 should work).

Upvotes: 2

Related Questions