adl
adl

Reputation: 1441

Rolling sum of one variable in data.frame in number of steps defined by another variable

I'm trying to sum up the values in a data.frame in a cumulative way.

I have this:

df <- data.frame(
  a = rep(1:2, each = 5),
  b = 1:10,
  step_window = c(2,3,1,2,4, 1,2,3,2,1)
)

I'm trying to sum up the values of b, within the groups a. The trick is, I want the sum of b values that corresponds to the number of rows following the current row given by step_window.

This is the output I'm looking for:

data.frame(
    a = rep(1:2, each = 5),
    step_window = c(2,3,1,2,4, 
                    1,2,3,2,1),
    b = 1:10,
    sum_b_step_window = c(3, 9, 3, 9, 5,
                          6, 15, 27, 19, 10)
  ) 

I tried to do this using the RcppRoll but I get an error Expecting a single value:

df %>% 
  group_by(a) %>% 
  mutate(sum_b_step_window = RcppRoll::roll_sum(x = b, n = step_window))

Upvotes: 0

Views: 218

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 270448

1) rollapply

rollapply in zoo supports vector widths. partial=TRUE says that if the width goes past the end then use just the values within the data. (Another possibility would be to use fill=NA instead in which case it would fill with NA's if there were not enough data left) . align="left" specifies that the current value at each step is the left end of the range to sum.

library(dplyr)
library(zoo)

df %>%
  group_by(a) %>%
  mutate(sum = rollapply(b, step_window, sum, partial = TRUE, align = "left")) %>%
  ungroup

2) SQL

This can also be done in SQL by left joining df to itself on the indicated condition and then for each row summing over all rows for which the condition matches.

library(sqldf)

sqldf("select A.*, sum(B.b) as sum
  from df A 
  left join df B on B.rowid between A.rowid and A.rowid + A.step_window - 1
    and A.a = B.a
  group by A.rowid")

Upvotes: 2

pseudospin
pseudospin

Reputation: 2777

data.table solution using cumulative sums

setDT(df)
df[, sum_b_step_window := {
  cs <- c(0,cumsum(b))
  cs[pmin(.N+1, 1:.N+step_window)]-cs[pmax(1, (1:.N))]
},by = a]

Upvotes: 1

Edo
Edo

Reputation: 7858

Here is a solution with the package slider.

library(dplyr)
library(slider)

df %>%
    group_by(a) %>% 
    mutate(sum_b_step_window = hop_vec(b, row_number(), step_window+row_number()-1, sum)) %>% 
    ungroup() 

It is flexible on different window sizes.

Output:

# A tibble: 10 x 4
       a     b step_window sum_b_step_window
   <int> <int>       <dbl>             <int>
 1     1     1           2                 3
 2     1     2           3                 9
 3     1     3           1                 3
 4     1     4           2                 9
 5     1     5           4                 5
 6     2     6           1                 6
 7     2     7           2                15
 8     2     8           3                27
 9     2     9           2                19
10     2    10           1                10

slider is a couple-of-months-old tidyverse package specific for sliding window functions. Have a look here for more info: page, vignette

hop is the engine of slider. With this solution we are triggering different .start and .stop to sum the values of b according to the a groups.

With _vec you're asking hop to return a vector: a double in this case.

row_number() is a dplyr function that allows you to return the row number of each group, thus allowing you to slide along the rows.

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389355

I'm not sure if having variable window size is possible in any of the rolling function. Here is one way to do this using map2_dbl :

library(dplyr)
df %>% 
  group_by(a) %>% 
  mutate(sum_b_step_window = purrr::map2_dbl(row_number(), step_window, 
                             ~sum(b[.x:(.x + .y - 1)], na.rm = TRUE)))

#      a     b step_window sum_b_step_window
#   <int> <int>       <dbl>             <dbl>
# 1     1     1           2                 3
# 2     1     2           3                 9
# 3     1     3           1                 3
# 4     1     4           2                 9
# 5     1     5           4                 5
# 6     2     6           1                 6
# 7     2     7           2                15
# 8     2     8           3                27
# 9     2     9           2                19
#10     2    10           1                10

Upvotes: 2

Related Questions