dplyr reset counter when the threshold is reached

Question

I have following tibble called test:

  datetime                volume
                     
  2020-08-25 09:30:00.000      0
  2020-08-25 09:30:12.000    107
  2020-08-25 09:30:50.000    221
  2020-08-25 09:30:50.000    132
  2020-08-25 09:30:50.000    148
  2020-08-25 09:30:50.000    100
  2020-08-25 09:30:50.000    100
  2020-08-25 09:30:58.000    100
  2020-08-25 09:31:56.000    157
  2020-08-25 09:32:36.000    288
  2020-08-25 09:32:36.000    100
  2020-08-25 09:33:10.000    235
  2020-08-25 09:33:23.000    182
  2020-08-25 09:33:44.000    218
  2020-08-25 09:33:44.000    179
  2020-08-25 09:34:18.000    318
  2020-08-25 09:34:27.000    101
  2020-08-25 09:34:27.000    157
  2020-08-25 09:34:27.000    200
  2020-08-25 09:34:27.000    114

I wanted to calculate the cumulative time difference (or even the just the number of rows where the timestamps are the same) when a threshold for the volume is reached. Once the threshold is reached/surpassed, I reset the counter to 0, and accumulate from that point onward again.

For example, if my threshold is 300, I accumulate from row 1 to row 3, I'd get 0+107+221=328, I would now choose to:

retain the timestamp of this row,
calculate the time difference from row 1 to row 3,
or count the number of rows till I reached the threshold

any of above would serve the purpose, best option would be retaining the timestamp.

Next step is to reset the counter (which at the moment stays at 328) and start counting again from row 4; from row 4 to row 7 I accumulate 148+100+100=348, I'd retain the timestamp again (for example). I'd then again, reset the counter and move on again.

I was trying to do this with dplyr or generally in tidyverse however I wasn't able to come up with reasonable solution. I don't think there's a way to do this just solely piping-along-with-dplyr.

I think I can get by with a for-loop but that'd my last option. The difficult part for me is to reset the counter and start counting again.

tmfmnk · Accepted Answer

One dplyr and purrr possibility could be:

df %>%
 group_by(group_id = cumsum(c(0, diff(accumulate(volume, ~ if_else(.x >= 300, .y, .x + .y))) < 0))) %>%
 summarise(timestamp_first = first(datetime),
           timestamp_last = last(datetime),
           time_diff = last(datetime) - first(datetime),
           n_rows = n(),
           volume_sum = sum(volume))

  group_id timestamp_first     timestamp_last      time_diff n_rows volume_sum
                                             
1        0 2020-08-25 09:30:00 2020-08-25 09:30:50 50 secs        3        328
2        1 2020-08-25 09:30:50 2020-08-25 09:30:50  0 secs        3        380
3        2 2020-08-25 09:30:50 2020-08-25 09:31:56 66 secs        3        357
4        3 2020-08-25 09:32:36 2020-08-25 09:32:36  0 secs        2        388
5        4 2020-08-25 09:33:10 2020-08-25 09:33:23 13 secs        2        417
6        5 2020-08-25 09:33:44 2020-08-25 09:33:44  0 secs        2        397
7        6 2020-08-25 09:34:18 2020-08-25 09:34:18  0 secs        1        318
8        7 2020-08-25 09:34:27 2020-08-25 09:34:27  0 secs        3        458
9        8 2020-08-25 09:34:27 2020-08-25 09:34:27  0 secs        1        114

dplyr reset counter when the threshold is reached

Answers (2)

Related Questions