Reputation: 475
I want to calculate a sum over a sliding window on grouped data.
As I would like to stick to official functions if possible I started with rollapplyr like this:
library(tidyverse)
library(reshape2)
library(zoo)
data = data.frame(Count=seq(1,10,1),
group=c("A","B","A","A","B","B","B","B","A","A"))
window_size <- 3
data_rolling <- data %>%
arrange(group) %>%
group_by(group) %>%
mutate(Rolling_Count = rollapplyr(Count, width=window_size, FUN=sum, fill = NA)) %>%
ungroup()
for the first entries which are smaller than width (in this case 3) it gets filled with NA as defined, but I would actually like to have the sum of the possible data there like this:
Count group Rolling_Count expected_Result
1 A NA 1
3 A NA 4
4 A 8 8
9 A 16 16
10 A 23 23
2 B NA 2
5 B NA 7
6 B 13 13
7 B 18 18
8 B 21 21
I know that I can replace the width=window_size
with something like this:
c(rep(1:window_size,1),rep(window_size:window_size,(n()-window_size)))
to get what I want but this is really slow. In addition this approach would assume that n() is greater than window_size.
So: Is there already an R/zoo function which can handle grouped data like above and in addition data with less than window_size entries and is faster to the above approach?
Thanks for any hints!
Upvotes: 4
Views: 872
Reputation: 269481
Here are a few different approaches. (2), (4) and (5) assume that Count has no NA's (which is the case in the question's example data).
1) rollapplyr/partial Use partial=TRUE
in rollapplyr
:
data %>%
arrange(group) %>%
mutate(Rolling_Count =
rollapplyr(Count, window_size, sum, fill = NA, partial = TRUE),
.by = group)
2) rollsumr/coalesce or use coalesce
with rollsumr
data %>%
arrange(group) %>%
mutate(Rolling_Count = coalesce(
rollsumr(Count, window_size, sum, fill = NA),
cumsum(Count)), .by = group)
2a) rollsumr prepending zeros Here we prepend window_size-1 zeros or
data %>%
arrange(group) %>%
mutate(Rolling_Count =
rollsumr(c(numeric(window_size-1), Count), window_size, sum),
.by = group)
3) sql Another possibility is to use SQL
library(sqldf)
library(magrittr)
data %>% {
fn$sqldf("select
*,
sum(Count) over
(partition by [group]
order by rowid
rows between `window_size` - 1 preceding and current row) as Rolling_Count
from [.]"
) }
4) diff/cumsum We can also use diff/cumsum
to get the result using only dplyr
library(dplyr)
data %>%
arrange(group) %>%
mutate(Rolling_Count = window_size %>%
numeric %>%
c(Count) %>%
cumsum %>%
diff(window_size), .by = group)
5) Base R This uses only base R. It is similar to (4).
o <- order(data$group)
transform(data[o, ], Rolling_Count = ave(Count, group, FUN = \(x)
window_size |>
numeric() |>
c(x) |>
cumsum() |>
diff(window_size)))
Upvotes: 0
Reputation: 5137
simply use the partial argument that rollapplyr provides for this purpose
rollapplyr(Count,
width=window_size,
FUN=sum,
fill = NA,
partial=TRUE)
Upvotes: 0
Reputation: 1253
If you're willing to utilise data.table
and collapse
we can achieve optimal performance by pre-computing the grouping structure, sorting the data by it if it already isn't, calculating a vector of window widths and passing that to data.table::frollsum()
.
I have an implementation in my package: https://github.com/NicChr/timeplyr
window_size <- 3
data[, Rolling_Count := timeplyr::roll_sum(Count, g = group, window = window_size)][]
Count group Group_RowNumber Rolling_Count
1: 1 A 1 1
2: 3 A 2 4
3: 4 A 3 8
4: 9 A 4 16
5: 10 A 5 23
6: 2 B 1 2
7: 5 B 2 7
8: 6 B 3 13
9: 7 B 4 18
10: 8 B 5 21
# Groups
groups <- collapse::GRP(data, by = "group")
# Group sizes
group_sizes <- collapse::GRPN(groups, expand = F)
# Window vector
window_vec <- pmin(sequence(group_sizes), window_size)
# Sort data by groups
setkey(data, group)
# Rolling sum by group
data[, Rolling_Count := data.table::frollsum(Count,
n = window_vec,
align = "right", adaptive = TRUE)]
# Sort back
data[collapse::greorder(seq_len(.N), g = groups)]
Count group Rolling_Count
1: 1 A 1
2: 3 A 4
3: 4 A 8
4: 9 A 16
5: 10 A 23
6: 2 B 2
7: 5 B 7
8: 6 B 13
9: 7 B 18
10: 8 B 21
library(bench)
df <- data.table(g = sample.int(10^4, 10^5, TRUE),
x = rnorm(10^5))
mark(e1 = df[, sum := frollsum(x, n = 7, align = "right", na.rm = FALSE), by = "g"]$sum,
e2 = df[, sum := timeplyr::roll_sum(x, window = 7, g = get("g"), partial = FALSE, na.rm = FALSE)]$sum)
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <bch:tm> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list>
1 e1 622.72ms 622.7ms 1.61 158.57MB 1.61 1 1 623ms <dbl> <Rprofmem> <bench_tm>
2 e2 8.09ms 11.1ms 89.6 4.02MB 0 45 0 502ms <dbl> <Rprofmem> <bench_tm>
# i 1 more variable: gc <list>
Upvotes: 0
Reputation: 16697
Here is another solutions, which is a little more base-R-ish and still should not lag behind in performance. It might be actually faster as it lacks all the features that rolling functions adds. We could replace shift
function from data.table
with base-R operation, then should be the fastest you can get in base R.
Note that this function will fail badly if some NAs will be present in input, also is more likely to suffer from floating point rounding error.
data = data.frame(Count=seq(1,10,1),
group=c("A","B","A","A","B","B","B","B","A","A"))
window_size = 3
library(data.table)
setDT(data)
# base R fast rolling sum
bRfrs = function(x, n) {
cumx = cumsum(x)
cumx - shift(cumx, n, fill=0)
}
data[, .(Count, Rolling_Count=bRfrs(Count, window_size)), group]
# group Count Rolling_Count
# 1: A 1 1
# 2: A 3 4
# 3: A 4 8
# 4: A 9 16
# 5: A 10 23
# 6: B 2 2
# 7: B 5 7
# 8: B 6 13
# 9: B 7 18
#10: B 8 21
In 1.12.4 release of data.table we are planning to add frollsum
function already, then it will be another high performance option to achieve what you are looking for.
Upvotes: 1
Reputation: 4242
A solution based on data.table
and RcppRoll
that should be much more performant.
It's not as clean as I would like -- there's actually a partial
argument in RcppRoll::roll_sum()
that hasn't been implemented yet that would theoretically solve this cleanly, but it doesn't seem like that will be worked anytime soon-- see GH Issue #18 .
At any rate, until someone implements a rolling sum in R that allows what you need here, adding in a cumsum
on the first n - 1
rows seems to be a sensible solution.
library(data.table)
library(RcppRoll)
data = data.frame(Count=seq(1,10,1),
group=c("A","B","A","A","B","B","B","B","A","A"))
## Convert to a `data.table` by reference
setDT(data)
window_size <- 3
## Add a counter row so that we can go back and fill in rows
## 1 & 2 of each group
data[,Group_RowNumber := seq_len(.N), keyby = .(group)]
## Do a rolling window -- this won't fill in the first 2 rows
data[,Rolling_Count := RcppRoll::roll_sum(Count,
n = window_size,
align = "right",
fill = NA), keyby = .(group)]
## Go back and fill in the ones we missed
data[Group_RowNumber < window_size, Rolling_Count := cumsum(Count), by = .(group)]
data
# Count group Group_RowNumber Rolling_Count
# 1: 1 A 1 1
# 2: 3 A 2 4
# 3: 4 A 3 8
# 4: 9 A 4 16
# 5: 10 A 5 23
# 6: 2 B 1 2
# 7: 5 B 2 7
# 8: 6 B 3 13
# 9: 7 B 4 18
# 10: 8 B 5 21
Upvotes: 3