zhang
zhang

Reputation: 543

Adding a cumulative sum column in a data.table with a minimum threshold

I have a data.table in R with a column called "score" containing both positive and negative real numbers. I want to add a new column called "sum" which represents the cumulative sum of the "score" column. However, I need the cumulative sum to have a minimum threshold. If the sum at any point falls below the minimum threshold, it should be set to the minimum value, and the next step of the cumulative sum should start from that minimum value.

I have tried using the cumsum() function to calculate the cumulative sum, but I'm not sure how to incorporate the minimum threshold logic. Here's my code so far:

dt <- data.table(score = c(10, -13, 8, -12, 6))

dt[, sum := cumsum(score)]

# its wrong
print(dt)

   score sum
1:    10  10
2:   -13  -3
3:     8   5
4:   -12  -7
5:     6  -1

I also tried

min_value <- -3
dt[, sum := {
  temp_sum <- cumsum(score)
  temp_sum[temp_sum < min_value] <- min_value
  temp_sum
}]

# its also wrong
print(dt)
   score sum
1:    10  10
2:   -13  -3
3:     8   5
4:   -12  -3
5:     6  -1 # expected sum is 3 here

I also have tried this from Cumulative sum with a threshold window in R data.table and dplyr / R cumulative sum with reset

sum_score <- function(thresh) {
  function(x) {
    purrr::accumulate(x, ~dplyr::if_else(.x<=thresh, .y, .x+.y))
  }  
}

dt[, sum := sum_score(-3)(score)]

print(dt)
   score sum
1:    10  10
2:   -13  -3
3:     8   8
4:   -12  -4
5:     6   6

Expected output:

   score sum
1:    10  10
2:   -13  -3
3:     8   5
4:   -12  -3
5:     6  3

Upvotes: 0

Views: 112

Answers (2)

Julian
Julian

Reputation: 9240

Here is a dplyr option:

library(dplyr)
library(purrr)
library(tidyr)
dt <- data.table::data.table(score = c(10, -13, 8, -12, 6, -20, 40))

dt |> 
  mutate(
    ID = row_number(),
    raw_score = score,
    indicator = (ifelse(cumsum(score) < -3, row_number(), NA))) |> 
  fill(indicator) |> 
  group_split(indicator) |> 
  map_dfr(~.x |> 
        mutate(score = ifelse(
          !is.na(indicator) & row_number() == 1, -3, score
        ))) |> 
  arrange(ID) |> 
  group_by(indicator) |> 
  mutate(sum = cumsum(score)) |> 
  ungroup() |> 
  select(raw_score, sum)

Output:

  raw_score   sum
      <dbl> <dbl>
1        10    10
2       -13    -3
3         8     5
4       -12    -3
5         6     3
6       -20    -3
7        40    37

Upvotes: 1

Meisam
Meisam

Reputation: 631

I understand you preferably want a data.table solution but here is a simple dplyr method, which first generated the cumsum then alters its values accordingly:

library(dplyr)

dt <- data.frame(score = c(10, -13, 8, -12, 6))

thresh = -3

dt %>% mutate(sum = cumsum(score)) %>% 
  mutate(sum = case_when(
    lag(sum) < thresh ~ score + thresh,
    sum < thresh ~ thresh,
    is.na(sum) ~ score,
    .default = sum)
    )

Output:

  score   sum
1    10    10
2   -13    -3
3     8     5
4   -12    -3
5     6     3

Upvotes: 1

Related Questions