Stephan Claus
Stephan Claus

Reputation: 475

Is there a fast R function like rollapplyr with increasing window size?

I want to calculate a sum over a sliding window on grouped data.

As I would like to stick to official functions if possible I started with rollapplyr like this:

library(tidyverse)
library(reshape2)
library(zoo)  

data = data.frame(Count=seq(1,10,1),
                  group=c("A","B","A","A","B","B","B","B","A","A"))


window_size <- 3    

data_rolling <- data %>%
  arrange(group) %>%
  group_by(group) %>%
  mutate(Rolling_Count = rollapplyr(Count, width=window_size, FUN=sum, fill = NA)) %>%
  ungroup()

for the first entries which are smaller than width (in this case 3) it gets filled with NA as defined, but I would actually like to have the sum of the possible data there like this:

 Count group Rolling_Count expected_Result
 1     A            NA    1
 3     A            NA    4
 4     A            8     8
 9     A            16    16
10     A            23    23
 2     B            NA    2
 5     B            NA    7
 6     B            13    13
 7     B            18    18
 8     B            21    21

I know that I can replace the width=window_size with something like this:

c(rep(1:window_size,1),rep(window_size:window_size,(n()-window_size)))

to get what I want but this is really slow. In addition this approach would assume that n() is greater than window_size.

So: Is there already an R/zoo function which can handle grouped data like above and in addition data with less than window_size entries and is faster to the above approach?

Thanks for any hints!

Upvotes: 4

Views: 872

Answers (5)

G. Grothendieck
G. Grothendieck

Reputation: 269481

Here are a few different approaches. (2), (4) and (5) assume that Count has no NA's (which is the case in the question's example data).

1) rollapplyr/partial Use partial=TRUE in rollapplyr:

data %>%
  arrange(group) %>%
  mutate(Rolling_Count = 
    rollapplyr(Count, window_size, sum, fill = NA, partial = TRUE),
    .by = group)

2) rollsumr/coalesce or use coalesce with rollsumr

data %>%
  arrange(group) %>%
  mutate(Rolling_Count = coalesce(
    rollsumr(Count, window_size, sum, fill = NA), 
    cumsum(Count)), .by = group)

2a) rollsumr prepending zeros Here we prepend window_size-1 zeros or

data %>%
  arrange(group) %>%
  mutate(Rolling_Count = 
       rollsumr(c(numeric(window_size-1), Count), window_size, sum), 
    .by = group)

3) sql Another possibility is to use SQL

library(sqldf)
library(magrittr)

data %>% {
  fn$sqldf("select
    *, 
    sum(Count) over 
      (partition by [group]
      order by rowid
      rows between `window_size` - 1 preceding and current row) as Rolling_Count
    from [.]"
) }

4) diff/cumsum We can also use diff/cumsum to get the result using only dplyr

library(dplyr)

data %>%
  arrange(group) %>%
  mutate(Rolling_Count = window_size %>%
                         numeric %>%
                         c(Count) %>%
                         cumsum %>%
                         diff(window_size), .by = group)

5) Base R This uses only base R. It is similar to (4).

o <- order(data$group)
transform(data[o, ], Rolling_Count = ave(Count, group, FUN = \(x)
  window_size |>
  numeric() |>
  c(x) |>
  cumsum() |>
  diff(window_size)))

Upvotes: 0

Nir Graham
Nir Graham

Reputation: 5137

simply use the partial argument that rollapplyr provides for this purpose

rollapplyr(Count,
 width=window_size,
 FUN=sum,
 fill = NA,
partial=TRUE)

Upvotes: 0

NicChr
NicChr

Reputation: 1253

If you're willing to utilise data.table and collapse we can achieve optimal performance by pre-computing the grouping structure, sorting the data by it if it already isn't, calculating a vector of window widths and passing that to data.table::frollsum().

I have an implementation in my package: https://github.com/NicChr/timeplyr

window_size <- 3
data[, Rolling_Count := timeplyr::roll_sum(Count, g = group, window = window_size)][]
    Count group Group_RowNumber Rolling_Count
 1:     1     A               1             1
 2:     3     A               2             4
 3:     4     A               3             8
 4:     9     A               4            16
 5:    10     A               5            23
 6:     2     B               1             2
 7:     5     B               2             7
 8:     6     B               3            13
 9:     7     B               4            18
10:     8     B               5            21

Doing the same as above manually

# Groups
groups <- collapse::GRP(data, by = "group")
# Group sizes
group_sizes <- collapse::GRPN(groups, expand = F)

# Window vector
window_vec <- pmin(sequence(group_sizes), window_size)

# Sort data by groups
setkey(data, group)
# Rolling sum by group
data[, Rolling_Count := data.table::frollsum(Count, 
                                             n = window_vec,
                                             align = "right", adaptive = TRUE)]
# Sort back
data[collapse::greorder(seq_len(.N), g = groups)]
    Count group Rolling_Count
 1:     1     A             1
 2:     3     A             4
 3:     4     A             8
 4:     9     A            16
 5:    10     A            23
 6:     2     B             2
 7:     5     B             7
 8:     6     B            13
 9:     7     B            18
10:     8     B            21

Benchmark for lots of groups


library(bench)
df <- data.table(g = sample.int(10^4, 10^5, TRUE),
                 x = rnorm(10^5))
mark(e1 = df[, sum := frollsum(x, n = 7, align = "right", na.rm = FALSE), by = "g"]$sum,
     e2 = df[, sum := timeplyr::roll_sum(x, window = 7, g = get("g"), partial = FALSE, na.rm = FALSE)]$sum)
# A tibble: 2 x 13
  expression      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time      
  <bch:expr> <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>    
1 e1         622.72ms 622.7ms      1.61  158.57MB     1.61     1     1      623ms <dbl>  <Rprofmem> <bench_tm>
2 e2           8.09ms  11.1ms     89.6     4.02MB     0       45     0      502ms <dbl>  <Rprofmem> <bench_tm>
# i 1 more variable: gc <list>

Upvotes: 0

jangorecki
jangorecki

Reputation: 16697

Here is another solutions, which is a little more base-R-ish and still should not lag behind in performance. It might be actually faster as it lacks all the features that rolling functions adds. We could replace shift function from data.table with base-R operation, then should be the fastest you can get in base R.
Note that this function will fail badly if some NAs will be present in input, also is more likely to suffer from floating point rounding error.

data = data.frame(Count=seq(1,10,1),
                  group=c("A","B","A","A","B","B","B","B","A","A"))
window_size = 3

library(data.table)
setDT(data)
# base R fast rolling sum
bRfrs = function(x, n) {
  cumx = cumsum(x)
  cumx - shift(cumx, n, fill=0)
}
data[, .(Count, Rolling_Count=bRfrs(Count, window_size)), group]
#    group Count Rolling_Count
# 1:     A     1             1
# 2:     A     3             4
# 3:     A     4             8
# 4:     A     9            16
# 5:     A    10            23
# 6:     B     2             2
# 7:     B     5             7
# 8:     B     6            13
# 9:     B     7            18
#10:     B     8            21

In 1.12.4 release of data.table we are planning to add frollsum function already, then it will be another high performance option to achieve what you are looking for.

Upvotes: 1

Matt Summersgill
Matt Summersgill

Reputation: 4242

A solution based on data.table and RcppRoll that should be much more performant.

It's not as clean as I would like -- there's actually a partial argument in RcppRoll::roll_sum() that hasn't been implemented yet that would theoretically solve this cleanly, but it doesn't seem like that will be worked anytime soon-- see GH Issue #18 .

At any rate, until someone implements a rolling sum in R that allows what you need here, adding in a cumsum on the first n - 1 rows seems to be a sensible solution.

library(data.table)
library(RcppRoll)

data = data.frame(Count=seq(1,10,1),
                  group=c("A","B","A","A","B","B","B","B","A","A"))

## Convert to a `data.table` by reference
setDT(data)
window_size <- 3    

## Add a counter row so that we can go back and fill in rows
## 1 & 2 of each group
data[,Group_RowNumber := seq_len(.N), keyby = .(group)]

## Do a rolling window -- this won't fill in the first 2 rows
data[,Rolling_Count := RcppRoll::roll_sum(Count,
                                          n = window_size,
                                          align = "right",
                                          fill = NA), keyby = .(group)]

## Go back and fill in the ones we missed
data[Group_RowNumber < window_size, Rolling_Count := cumsum(Count), by = .(group)]

data

#     Count group Group_RowNumber Rolling_Count
#  1:     1     A               1             1
#  2:     3     A               2             4
#  3:     4     A               3             8
#  4:     9     A               4            16
#  5:    10     A               5            23
#  6:     2     B               1             2
#  7:     5     B               2             7
#  8:     6     B               3            13
#  9:     7     B               4            18
# 10:     8     B               5            21

Upvotes: 3

Related Questions