Reputation: 23
I have a massive dataset with a column "amount," which I want to groupby each time it goes above a cumulative-sum-threshold. However, my current solutions are too slow to make it feasible.
Example:
library(data.table)
dt <- data.table(amount = c(0.009288, 0.189386, 0.071265, 0.137387, 0.032742, 0.000003, 0.071265, 0.122247, 0.124920, 0.032743))
And with a threshold of 0.2, I'm looking for an "indicator" that should be:
such that I can use the indicator to group/select the observations by the cumulative sum up to and including the threshold observation.
I have tried looping over a "cumsum" column, and played around with trying to use a logical vector, but since I need to extract the next "starting point" after each cumsum, or reset the sum, it is too slow:
to_add <- 0.2
dt[, cumSum:=cumsum(amount)]
dt[, indicator:=as.numeric(NA)]
threshold <- copy(to_add)
for(i in 1:nrow(dt)) {
print(i)
if(threshold > max(dt$cumSum)) break
amount_t <- dt[cumSum >= threshold]$cumSum[1]
dt[cumSum <= amount_t & is.na(indicator), indicator:=i]
threshold <- amount_t + to_add
}
dt[is.na(indicator), indicator:=i]
Upvotes: 2
Views: 68
Reputation: 28825
In base
using Reduce
, we can put a condition on accumulating the rows, to find when the sum is going above threshold.
library(data.table)
dt[, `:=`(csum = cumsum(amount),
id = shift(cumsum(Reduce(function(lag_csum, r_val){
ifelse(lag_csum < 0.2, lag_csum + r_val , r_val)},
amount, accumulate = TRUE) > 0.2), fill = 0) + 1)][]
#> amount csum id
#> 1: 0.009288 0.009288 1
#> 2: 0.189386 0.198674 1
#> 3: 0.071265 0.269939 1
#> 4: 0.137387 0.407326 2
#> 5: 0.032742 0.440068 2
#> 6: 0.000003 0.440071 2
#> 7: 0.071265 0.511336 2
#> 8: 0.122247 0.633583 3
#> 9: 0.124920 0.758503 3
#> 10: 0.032743 0.791246 4
or Using purrr:accumulate
:
dt[, `:=`(csum = cumsum(amount),
id = shift(cumsum(purrr::accumulate(amount,
~ifelse(.x < 0.2, .x + .y, .y)) > 0.2),
fill = 0) + 1)][]
#> amount csum id
#> 1: 0.009288 0.009288 1
#> 2: 0.189386 0.198674 1
#> 3: 0.071265 0.269939 1
#> 4: 0.137387 0.407326 2
#> 5: 0.032742 0.440068 2
#> 6: 0.000003 0.440071 2
#> 7: 0.071265 0.511336 2
#> 8: 0.122247 0.633583 3
#> 9: 0.124920 0.758503 3
#> 10: 0.032743 0.791246 4
Upvotes: 3