guzbrush
guzbrush

Reputation: 123

R: Use rollapply on list column

I have a data frame df with a dates column and a values column and want to calculate the 5% quantile of values for the past n days for every date in dates. The problem is that the dates do not appear uniquely and in different (random) cardinalities. For example

library(lubridate)
library(tidyverse)
library(zoo)

n <- 3

dates_v <- seq(as_date("2018-09-01"), as_date("2018-09-14"), by = "days") 

df <- data.frame(dates = rep(dates_v,c(3, 2, 1, 4, 1, 5, 1, 3, 3, 2, 5, 3, 4, 3)), 
                 values = rep(seq(1,5),8))

I can write a for loop that solves this problem, but this is quite slow:

df2 <- list()

for (k in dates_v[n:length(dates_v)]) {
  k <- as_date(k)

  df2 <- c(df2,
  df %>%
    filter(dates >= k %m-% days(n-1) & dates <= k) %>%
    mutate(dates = k) %>%
    group_by(dates) %>%
    summarise(values = quantile(values, 0.05)) %>%
    list())
}

df2 <- df2 %>%
  bind_rows() 

I tried the zoo package, but the rollapply functions do not seem to be applicable here due to the varying window sizes. One idea I had was to transform the values column into a nested column using purrr::nest and then use rollapply to roll-concatenate the entries of the nested column

df2 <- df %>%
  group_by(dates) %>%
  nest() %>%
  mutate(data = map(data, unlist))

df2$data <- rollapply(df2$data, width = n, c, align = "right")

df2 %>% 
  mutate(data = map(data, ~quantile(., 0.05)))

but that did not work out. Is there something I am doing wrong, or is rollapply simply not working with list columns?

Edit:

A more realistic example for my use case is a data frame of the form

dates_v <- seq(as_date("2018-01-01"), as_date("2018-09-14"), by = "days") 

df <- data.frame(dates = rep(dates_v,sample(seq(9000,11000), length(dates_v), replace = TRUE))) %>%
  mutate(values = rnorm(length(dates)))

Upvotes: 0

Views: 567

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269481

rollapply can be used with varying widths by specifying a vector of widths, w, one per element. r gives the quantiles for all rows from the first dates-2 row to the current row and the last line of code drops rows having dates which are not the last occurrence of that date and also drops the value column.

w <- seq_along(df$dates) - match(df$dates - 2, df$dates, nomatch = 0)
r <- transform(df, `5%` = rollapplyr(values, w, quantile, 0.05), 
  check.names = FALSE)

r[!duplicated(df$dates, fromLast = TRUE), -2]

giving:

        dates   5%
3  2018-09-01 1.10
5  2018-09-02 1.20
6  2018-09-03 1.20
10 2018-09-04 1.25
11 2018-09-05 1.20
16 2018-09-06 1.00
17 2018-09-07 1.25
20 2018-09-08 1.35
23 2018-09-09 1.25
25 2018-09-10 1.30
30 2018-09-11 1.40
33 2018-09-12 1.00
37 2018-09-13 1.00
40 2018-09-14 1.40

or with pipes and using w from above:

df %>%
   mutate(`5%` = rollapplyr(.$values, w, quantile, 0.05)) %>%
   filter(!duplicated(.$dates, fromLast = TRUE)) %>%
   select(-values)

Upvotes: 2

phiver
phiver

Reputation: 23598

Instead of a loop you can use sapply like this:

n <- 3
sapply(unique(df$dates), function(x){
                        quantile(df$values[df$dates >= x - (n-1) & df$dates <= x], 0.05)
                      })
  5%   5%   5%   5%   5%   5%   5%   5%   5%   5%   5%   5%   5%   5% 
1.10 1.20 1.00 1.30 1.00 1.00 1.00 1.40 1.30 1.35 1.00 1.00 1.00 1.00 

To get it into a data.frame you could do this:

outcome <- data.frame(dates = unique(df$dates),
                      quantiles = sapply(unique(df$dates), function(x){
                        quantile(df$values[df$dates >= x - (n-1) & df$dates <= x], 0.05)
                      })
                      )
        dates quantiles
1  2018-09-01      1.10
2  2018-09-02      1.20
3  2018-09-03      1.00
4  2018-09-04      1.30
5  2018-09-05      1.00
6  2018-09-06      1.00
7  2018-09-07      1.00
8  2018-09-08      1.40
9  2018-09-09      1.30
10 2018-09-10      1.35
11 2018-09-11      1.00
12 2018-09-12      1.00
13 2018-09-13      1.00
14 2018-09-14      1.00

Upvotes: 2

Related Questions