lewkiz
lewkiz

Reputation: 23

How to do a rolling sum where the width depends on the sum itself?

I have a massive dataset with a column "amount," which I want to groupby each time it goes above a cumulative-sum-threshold. However, my current solutions are too slow to make it feasible.

Example:

library(data.table)
dt <- data.table(amount = c(0.009288, 0.189386, 0.071265, 0.137387, 0.032742, 0.000003, 0.071265, 0.122247, 0.124920, 0.032743))

And with a threshold of 0.2, I'm looking for an "indicator" that should be:

Expected output

such that I can use the indicator to group/select the observations by the cumulative sum up to and including the threshold observation.

I have tried looping over a "cumsum" column, and played around with trying to use a logical vector, but since I need to extract the next "starting point" after each cumsum, or reset the sum, it is too slow:

to_add <- 0.2
dt[, cumSum:=cumsum(amount)]
dt[, indicator:=as.numeric(NA)]
threshold <- copy(to_add)
for(i in 1:nrow(dt)) {
  print(i)
  if(threshold > max(dt$cumSum)) break
  amount_t <- dt[cumSum >= threshold]$cumSum[1]
  dt[cumSum <= amount_t & is.na(indicator), indicator:=i]
  threshold <- amount_t + to_add
}
dt[is.na(indicator), indicator:=i]

Upvotes: 2

Views: 68

Answers (1)

M--
M--

Reputation: 28825

In base using Reduce, we can put a condition on accumulating the rows, to find when the sum is going above threshold.

library(data.table)

dt[, `:=`(csum = cumsum(amount),
          id = shift(cumsum(Reduce(function(lag_csum, r_val){
                                      ifelse(lag_csum < 0.2, lag_csum + r_val , r_val)}, 
                    amount, accumulate = TRUE) > 0.2), fill = 0) + 1)][]
#>       amount     csum id
#>  1: 0.009288 0.009288  1
#>  2: 0.189386 0.198674  1
#>  3: 0.071265 0.269939  1
#>  4: 0.137387 0.407326  2
#>  5: 0.032742 0.440068  2
#>  6: 0.000003 0.440071  2
#>  7: 0.071265 0.511336  2
#>  8: 0.122247 0.633583  3
#>  9: 0.124920 0.758503  3
#> 10: 0.032743 0.791246  4

or Using purrr:accumulate:

dt[, `:=`(csum = cumsum(amount),
          id = shift(cumsum(purrr::accumulate(amount, 
                                              ~ifelse(.x < 0.2, .x + .y, .y)) > 0.2), 
                     fill = 0) + 1)][]
#>       amount     csum id
#>  1: 0.009288 0.009288  1
#>  2: 0.189386 0.198674  1
#>  3: 0.071265 0.269939  1
#>  4: 0.137387 0.407326  2
#>  5: 0.032742 0.440068  2
#>  6: 0.000003 0.440071  2
#>  7: 0.071265 0.511336  2
#>  8: 0.122247 0.633583  3
#>  9: 0.124920 0.758503  3
#> 10: 0.032743 0.791246  4

Upvotes: 3

Related Questions