Reputation: 1177
My problem is quite simple but I still cannot find an easy solution that doesn't require me to create a lot of unnecessary filler rows.
Given this dataset:
df <- structure(
list(
date = c(
2015.16666666667,
2015.33333333333,
2015.83333333333,
2016,
2016.08333333333,
2016.25,
2016.33333333333,
2016.41666666667,
2016.5,
2016.66666666667
),
Age = c(
1,
2.99999999999818,
8.99999999999818,
10.9999999999991,
11.9999999999982,
13.9999999999991,
14.9999999999982,
16,
16.9999999999991,
19
),
year = c(
2015L,
2015L,
2015L,
2015L,
2016L,
2016L,
2016L,
2016L,
2016L,
2016L
),
month = c(2L, 4L,
10L, 12L, 1L, 3L, 4L, 5L, 6L, 8L),
r_Total = c(
481.02,
666.36,
851.7,
1633.74,
2155.1,
2613.74,
3105.44,
4429.52,
5170.88,
5170.88
)
),
row.names = c(NA,-10L),
class = c("tbl_df", "tbl", "data.frame")
)
I want to compute a moving average of r_total
for the last 12 months. However the data has no rows for months where r_total was 0 so my usual solution does not work:
library(dplyr)
df %>%
mutate(cummulative_sum = cumsum(r_total), moving_average = (cummulative_sum-lag(cummulative_sum,12) )/12)
This computes a moving average for the last 12 values but crucially not the last 12 months!
lag()
which I use here just looks in the ordered vector and identifies the value from the -12th position. However I need a function that gives me the value from when Age == Age-12
(Age being months since inception of the value).
So what can I do?
Upvotes: 0
Views: 51
Reputation: 5766
I too have been struggling with some of these moving-windows issues. And in cases where the step size or window size is not a constant number of rows, tidyverse-approaches can become cumbersome.
In these cases, although I sometimes try to hard to fit it into a piped fashion, ordinary loops can be easier to work with.
First attempt with a loop. Instead of thinking the main input is the data.frame, our main input is actually the dates at which we want to look back the 12 months (which so happens to be derived from df
, but you could instead choose to go by a calendar year or whatnot).
Remember, when using loops to build a result-set, use lists or pre-allocated result-vectors instead of growing a data.frame or vector by repeatedly appending.
df$yearmonth <- with(df, year + (month-1)/12)
df$cummulative <- NA_real_
for (i in seq_along(df$Age)) {
df$cummulative[i] <- df %>%
filter(between(Age, df$Age[i]-11, df$Age[i])) %>% ## any rows within the past 12 months
pull(r_Total) %>% sum
}
> df
# A tibble: 10 x 6
date Age year month r_Total cummulative
<dbl> <dbl> <int> <int> <dbl> <dbl>
1 2015. 1 2015 2 481. 481.
2 2015. 3.00 2015 4 666. 1147.
3 2016. 9.00 2015 10 852. 1999.
4 2016 11.0 2015 12 1634. 3633.
5 2016. 12.0 2016 1 2155. 5788.
6 2016. 14.0 2016 3 2614. 7254.
7 2016. 15.0 2016 4 3105. 10360.
8 2016. 16 2016 5 4430. 14789.
9 2016. 17.0 2016 6 5171. 19960.
10 2017. 19 2016 8 5171. 25131
It is however unclear what you are averaging over; is the cummulative r_total
always divided by 12 months, even when it's sum is only 2 rows?
Upvotes: 0
Reputation: 66510
The slider
package is great for when you need to use another column to define the time window.
library(slider)
df %>%
mutate(avg_12mo = slide_index_dbl(r_Total, Age, mean, .before = 11),
sum_12mo = slide_index_dbl(r_Total, Age, sum, .before = 11))
# A tibble: 10 x 7
date Age year month r_Total avg_12mo sum_12mo
<dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>
1 2015. 1 2015 2 481. 481. 481.
2 2015. 3.00 2015 4 666. 574. 1147.
3 2016. 9.00 2015 10 852. 666. 1999.
4 2016 11.0 2015 12 1634. 908. 3633.
5 2016. 12.0 2016 1 2155. 1158. 5788.
6 2016. 14.0 2016 3 2614. 1814. 7254.
7 2016. 15.0 2016 4 3105. 2072. 10360.
8 2016. 16 2016 5 4430. 2465. 14789.
9 2016. 17.0 2016 6 5171. 2851. 19960.
10 2017. 19 2016 8 5171. 3141. 25131
Upvotes: 1