Dynamic Cumsum with condition

Question

I have the following problem I am trying to solve. Here some example data:

library(tidyverse)
library(lubridate)

date <- data.frame(date=seq(ymd('2018-01-01'),ymd('2018-02-28'), by = '1 day'))
group <- data.frame(group=c("A","B"))
subgroup <- data.frame(subgroup=c("C","D"))

DF <- merge(merge(date,group,by=NULL),subgroup,by=NULL)
DF$group_value <- apply(DF, 1, function(x) sample(8:12,1))
DF$subgroup_value <- apply(DF, 1, function(x) sample(1:5,1))

DF <- DF %>%
  arrange(date,group,subgroup)

I now want to calculate the following:

for every given day t, group and subgroup combination calculate the number of days until the (backward) cumsum of subgroup_value is equal or greater than the group value of day t.

I know how to do that by using some for loops and some dplyr functionality, but this is just terrible slow:

for(i in seq(1,nrow(date),1)) {
  for(j in seq(1,nrow(group),1)) {
    for(k in seq(1,nrow(subgroup),1)) {
      tmp <- DF %>%
        filter(date<=date[i] & group == group[j] & subgroup == subgroup[k]) %>%
        arrange(desc(date))

      tmp$helper <- 1

      tmp <- tmp %>%
        mutate(
          cs_helper = cumsum(helper),
          cs_subgroup_value = cumsum(subgroup_value),
          nr_days = case_when (
            cs_subgroup_value >= group_value ~ cs_helper,
            TRUE ~ NA_real_)
        )

      #this is the final result for date[i], group[j], subgroup[k]
      value <- min(tmp[,"nr_days"], na.rm=T)
    }
  }
}

Example

head(DF,10)
         date group subgroup group_value subgroup_value result
1  2018-01-01     A        C          12              2     NA
2  2018-01-02     A        C          11              4     NA
3  2018-01-03     A        C          11              4     NA
4  2018-01-04     A        C           9              5      2
5  2018-01-05     A        C          12              5      3
6  2018-01-06     A        C          10              3      3
7  2018-01-07     A        C          12              5      3
8  2018-01-08     A        C           8              1      3
9  2018-01-09     A        C          12              4      4
10 2018-01-10     A        C           9              1      4

So for row 10, I need to sum the last 4 values of subgroup to be greater or equal to 9.

I am sure that this code can be highly optimized by using some vectorized version but I am struggle to find a good starting point for that (As you can see from the code above, I am a newbie in R)

My question is: How would you approach this problem in order to vectorize it for speed optimisation?

Thanks! Stephan

Tom Jemmett · Accepted Answer

Here's an attempt, take a copy of each group/subgroups data frame, and cross join to the data. This is then filtered to only find the days before. This allows us for each day to calculate all of the cumulative sums

DF %>%
  group_by(group, subgroup) %>%
  mutate(day = row_number(), J = TRUE) %>%
  nest() %>%
  arrange(group, subgroup) %>%
  mutate(data = map(data, function(d) {
    inner_join(d, transmute(d, x = day, v = subgroup_value, J), by = "J") %>%
      filter(day >= x) %>%
      mutate(x = day - x + 1) %>%
      arrange(day, x) %>%
      group_by(date, group_value, date, subgroup_value) %>%
      mutate(vv = cumsum(v),
             vv = ifelse(vv >= group_value, vv, NA),
             xx = ifelse(!is.na(vv), x, NA)) %>%
      group_by(date, group_value, day, subgroup_value) %>%
      summarise(x = min(xx, na.rm = TRUE), v = min(vv, na.rm = TRUE))
  })) %>%
  unnest()

Dynamic Cumsum with condition

Answers (1)

Related Questions