Reputation: 61

Cumulative sum of a subset of data based on condition

I have what i think is a simple R task but i'm having trouble. Basically I need to do a cumulative sum of values based on the criteria of another column.

Here's the catch, it should do the cumulative sum for the previous rows until it hits another condition. In the example i'm providing, it accumulates all values from the duration column, 1 and 2 in the condition column. Example is shown below.

duration <- c(2,3,2,4,5,10,2,9,7,5,8,9,10,12,4,5,6)
condition <- c(0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,2)
accum_sum <- c(0,5,0,0,0,0,0,32,0,0,0,0,39,0,0,0,27)

df <- data.frame(duration,condition,accum_sum)
df
row    duration condition accum_sum
1         2         0         0
2         3         1         5
3         2         0         0
4         4         0         0
5         5         0         0
6        10         0         0
7         2         0         0
8         9         2        32
9         7         0         0
10        5         0         0
11        8         0         0
12        9         0         0
13       10         1        39
14       12         0         0
15        4         0         0
16        5         0         0
17        6         2        27

Upvotes: 3

Answers (4)

Prem

Reputation: 11975

#cumulative sum
df$cum_sum <- ave(df$duration, c(0, cumsum(df$condition[-nrow(df)])), FUN = cumsum)

#replace all zero condition row with zero value in cumulative sum column
df$cum_sum <- ifelse(df$condition == 0, 0, df$cum_sum)

which gives

   duration condition cum_sum
1         2         0       0
2         3         1       5
3         2         0       0
4         4         0       0
5         5         0       0
6        10         0       0
7         2         0       0
8         9         2      32
9         7         0       0
10        5         0       0
11        8         0       0
12        9         0       0
13       10         1      39
14       12         0       0
15        4         0       0
16        5         0       0
17        6         2      27

Sample data:

df <- structure(list(duration = c(2, 3, 2, 4, 5, 10, 2, 9, 7, 5, 8, 
9, 10, 12, 4, 5, 6), condition = c(0, 1, 0, 0, 0, 0, 0, 2, 0, 
0, 0, 0, 1, 0, 0, 0, 2), cum_sum = c(0, 5, 0, 0, 0, 0, 0, 32, 
0, 0, 0, 0, 39, 0, 0, 0, 27)), .Names = c("duration", "condition", 
"cum_sum"), row.names = c(NA, -17L), class = "data.frame")

Upvotes: 2

r.user.05apr

Reputation: 5456

If you shift condition by 1, you can simply use tapply.

duration <- c(2,3,2,4,5,10,2,9,7,5,8,9,10,12,4,5,6)
condition <- c(0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,2)
accum_sum <- c(0,5,0,0,0,0,0,32,0,0,0,0,39,0,0,0,27)

df <- data.frame(duration,condition,accum_sum)

df$want <- unlist(tapply(df$duration, 
                         INDEX = cumsum(c(df$condition[1], head(df$condition, -1))), 
                         cumsum)) * ifelse(df$condition == 0, 0, 1)
df

Upvotes: 0

Roland

Reputation: 132969

Using data.table:

setDT(df)
df[, accum_sum := cumsum(duration), by = rev(cumsum(rev(condition)))]
df[condition == 0, accum_sum := 0]
#    duration condition accum_sum
# 1:        2         0         0
# 2:        3         1         5
# 3:        2         0         0
# 4:        4         0         0
# 5:        5         0         0
# 6:       10         0         0
# 7:        2         0         0
# 8:        9         2        32
# 9:        7         0         0
#10:        5         0         0
#11:        8         0         0
#12:        9         0         0
#13:       10         1        39
#14:       12         0         0
#15:        4         0         0
#16:        5         0         0
#17:        6         2        27

We create runs by filling the zeros backwards with rev(cumsum(rev(condition))) and then group by this "filled" condition.

Upvotes: 4

Marius

Reputation: 60190

Using dplyr, we can use cumsum() on condition to keep track of how many conditions have been seen. Then add within those subsets:

library(dplyr)

df %>%
    mutate(condition_group = cumsum(lag(condition, default = 0) != 0) + 1) %>%
    group_by(condition_group) %>%
    mutate(accum_sum = ifelse(condition != 0,
                              sum(duration),
                              0))

Output:

# A tibble: 17 x 4
# Groups:   condition_group [4]
   duration condition accum_sum condition_group
      <dbl>     <dbl>     <dbl>           <dbl>
 1        2         0         0               1
 2        3         1         5               1
 3        2         0         0               2
 4        4         0         0               2
 5        5         0         0               2
 6       10         0         0               2
 7        2         0         0               2
 8        9         2        32               2
 9        7         0         0               3
10        5         0         0               3
11        8         0         0               3
12        9         0         0               3
13       10         1        39               3
14       12         0         0               4
15        4         0         0               4
16        5         0         0               4
17        6         2        27               4

Upvotes: 1

Cumulative sum of a subset of data based on condition

Answers (4)

Related Questions