fish_r_user
fish_r_user

Reputation: 61

Cumulative sum of a subset of data based on condition

I have what i think is a simple R task but i'm having trouble. Basically I need to do a cumulative sum of values based on the criteria of another column.

Here's the catch, it should do the cumulative sum for the previous rows until it hits another condition. In the example i'm providing, it accumulates all values from the duration column, 1 and 2 in the condition column. Example is shown below.

duration <- c(2,3,2,4,5,10,2,9,7,5,8,9,10,12,4,5,6)
condition <- c(0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,2)
accum_sum <- c(0,5,0,0,0,0,0,32,0,0,0,0,39,0,0,0,27)

df <- data.frame(duration,condition,accum_sum)
df
row    duration condition accum_sum
1         2         0         0
2         3         1         5
3         2         0         0
4         4         0         0
5         5         0         0
6        10         0         0
7         2         0         0
8         9         2        32
9         7         0         0
10        5         0         0
11        8         0         0
12        9         0         0
13       10         1        39
14       12         0         0
15        4         0         0
16        5         0         0
17        6         2        27

Upvotes: 3

Views: 1387

Answers (4)

Prem
Prem

Reputation: 11975

#cumulative sum
df$cum_sum <- ave(df$duration, c(0, cumsum(df$condition[-nrow(df)])), FUN = cumsum)

#replace all zero condition row with zero value in cumulative sum column
df$cum_sum <- ifelse(df$condition == 0, 0, df$cum_sum)

which gives

   duration condition cum_sum
1         2         0       0
2         3         1       5
3         2         0       0
4         4         0       0
5         5         0       0
6        10         0       0
7         2         0       0
8         9         2      32
9         7         0       0
10        5         0       0
11        8         0       0
12        9         0       0
13       10         1      39
14       12         0       0
15        4         0       0
16        5         0       0
17        6         2      27

Sample data:

df <- structure(list(duration = c(2, 3, 2, 4, 5, 10, 2, 9, 7, 5, 8, 
9, 10, 12, 4, 5, 6), condition = c(0, 1, 0, 0, 0, 0, 0, 2, 0, 
0, 0, 0, 1, 0, 0, 0, 2), cum_sum = c(0, 5, 0, 0, 0, 0, 0, 32, 
0, 0, 0, 0, 39, 0, 0, 0, 27)), .Names = c("duration", "condition", 
"cum_sum"), row.names = c(NA, -17L), class = "data.frame")

Upvotes: 2

r.user.05apr
r.user.05apr

Reputation: 5456

If you shift condition by 1, you can simply use tapply.

duration <- c(2,3,2,4,5,10,2,9,7,5,8,9,10,12,4,5,6)
condition <- c(0,1,0,0,0,0,0,2,0,0,0,0,1,0,0,0,2)
accum_sum <- c(0,5,0,0,0,0,0,32,0,0,0,0,39,0,0,0,27)

df <- data.frame(duration,condition,accum_sum)

df$want <- unlist(tapply(df$duration, 
                         INDEX = cumsum(c(df$condition[1], head(df$condition, -1))), 
                         cumsum)) * ifelse(df$condition == 0, 0, 1)
df

Upvotes: 0

Roland
Roland

Reputation: 132969

Using data.table:

setDT(df)
df[, accum_sum := cumsum(duration), by = rev(cumsum(rev(condition)))]
df[condition == 0, accum_sum := 0]
#    duration condition accum_sum
# 1:        2         0         0
# 2:        3         1         5
# 3:        2         0         0
# 4:        4         0         0
# 5:        5         0         0
# 6:       10         0         0
# 7:        2         0         0
# 8:        9         2        32
# 9:        7         0         0
#10:        5         0         0
#11:        8         0         0
#12:        9         0         0
#13:       10         1        39
#14:       12         0         0
#15:        4         0         0
#16:        5         0         0
#17:        6         2        27

We create runs by filling the zeros backwards with rev(cumsum(rev(condition))) and then group by this "filled" condition.

Upvotes: 4

Marius
Marius

Reputation: 60190

Using dplyr, we can use cumsum() on condition to keep track of how many conditions have been seen. Then add within those subsets:

library(dplyr)

df %>%
    mutate(condition_group = cumsum(lag(condition, default = 0) != 0) + 1) %>%
    group_by(condition_group) %>%
    mutate(accum_sum = ifelse(condition != 0,
                              sum(duration),
                              0))

Output:

# A tibble: 17 x 4
# Groups:   condition_group [4]
   duration condition accum_sum condition_group
      <dbl>     <dbl>     <dbl>           <dbl>
 1        2         0         0               1
 2        3         1         5               1
 3        2         0         0               2
 4        4         0         0               2
 5        5         0         0               2
 6       10         0         0               2
 7        2         0         0               2
 8        9         2        32               2
 9        7         0         0               3
10        5         0         0               3
11        8         0         0               3
12        9         0         0               3
13       10         1        39               3
14       12         0         0               4
15        4         0         0               4
16        5         0         0               4
17        6         2        27               4

Upvotes: 1

Related Questions