Reputation: 123
I have a data frame df
with a dates
column and a values
column and want to calculate the 5% quantile of values
for the past n
days for every date in dates
. The problem is that the dates do not appear uniquely and in different (random) cardinalities. For example
library(lubridate)
library(tidyverse)
library(zoo)
n <- 3
dates_v <- seq(as_date("2018-09-01"), as_date("2018-09-14"), by = "days")
df <- data.frame(dates = rep(dates_v,c(3, 2, 1, 4, 1, 5, 1, 3, 3, 2, 5, 3, 4, 3)),
values = rep(seq(1,5),8))
I can write a for loop that solves this problem, but this is quite slow:
df2 <- list()
for (k in dates_v[n:length(dates_v)]) {
k <- as_date(k)
df2 <- c(df2,
df %>%
filter(dates >= k %m-% days(n-1) & dates <= k) %>%
mutate(dates = k) %>%
group_by(dates) %>%
summarise(values = quantile(values, 0.05)) %>%
list())
}
df2 <- df2 %>%
bind_rows()
I tried the zoo
package, but the rollapply
functions do not seem to be applicable here due to the varying window sizes. One idea I had was to transform the values
column into a nested column using purrr::nest
and then use rollapply
to roll-concatenate the entries of the nested column
df2 <- df %>%
group_by(dates) %>%
nest() %>%
mutate(data = map(data, unlist))
df2$data <- rollapply(df2$data, width = n, c, align = "right")
df2 %>%
mutate(data = map(data, ~quantile(., 0.05)))
but that did not work out. Is there something I am doing wrong, or is rollapply
simply not working with list columns?
Edit:
A more realistic example for my use case is a data frame of the form
dates_v <- seq(as_date("2018-01-01"), as_date("2018-09-14"), by = "days")
df <- data.frame(dates = rep(dates_v,sample(seq(9000,11000), length(dates_v), replace = TRUE))) %>%
mutate(values = rnorm(length(dates)))
Upvotes: 0
Views: 567
Reputation: 269481
rollapply
can be used with varying widths by specifying a vector of widths, w
, one per element. r
gives the quantiles for all rows from the first dates-2 row to the current row and the last line of code drops rows having dates which are not the last occurrence of that date and also drops the value
column.
w <- seq_along(df$dates) - match(df$dates - 2, df$dates, nomatch = 0)
r <- transform(df, `5%` = rollapplyr(values, w, quantile, 0.05),
check.names = FALSE)
r[!duplicated(df$dates, fromLast = TRUE), -2]
giving:
dates 5%
3 2018-09-01 1.10
5 2018-09-02 1.20
6 2018-09-03 1.20
10 2018-09-04 1.25
11 2018-09-05 1.20
16 2018-09-06 1.00
17 2018-09-07 1.25
20 2018-09-08 1.35
23 2018-09-09 1.25
25 2018-09-10 1.30
30 2018-09-11 1.40
33 2018-09-12 1.00
37 2018-09-13 1.00
40 2018-09-14 1.40
or with pipes and using w
from above:
df %>%
mutate(`5%` = rollapplyr(.$values, w, quantile, 0.05)) %>%
filter(!duplicated(.$dates, fromLast = TRUE)) %>%
select(-values)
Upvotes: 2
Reputation: 23598
Instead of a loop you can use sapply
like this:
n <- 3
sapply(unique(df$dates), function(x){
quantile(df$values[df$dates >= x - (n-1) & df$dates <= x], 0.05)
})
5% 5% 5% 5% 5% 5% 5% 5% 5% 5% 5% 5% 5% 5%
1.10 1.20 1.00 1.30 1.00 1.00 1.00 1.40 1.30 1.35 1.00 1.00 1.00 1.00
To get it into a data.frame you could do this:
outcome <- data.frame(dates = unique(df$dates),
quantiles = sapply(unique(df$dates), function(x){
quantile(df$values[df$dates >= x - (n-1) & df$dates <= x], 0.05)
})
)
dates quantiles
1 2018-09-01 1.10
2 2018-09-02 1.20
3 2018-09-03 1.00
4 2018-09-04 1.30
5 2018-09-05 1.00
6 2018-09-06 1.00
7 2018-09-07 1.00
8 2018-09-08 1.40
9 2018-09-09 1.30
10 2018-09-10 1.35
11 2018-09-11 1.00
12 2018-09-12 1.00
13 2018-09-13 1.00
14 2018-09-14 1.00
Upvotes: 2