89_Simple
89_Simple

Reputation: 3805

conditional rolling average

library(data.table)

set.seed(123)
d <- data.frame(ID = rep(1:5, each = 17), yearRef = rep(1998:2014, times = 5), y = sample(1:100, 17 * 5)) 

For each ID, I want to do a 7-years rolling average of y starting from 1998 onwards. However, the condition is that in each rolling window, I only select the top 5 highest value of y to do the average. For e.g.

first rolling window would be

1998-2004 - only do the average of top 5 highest 'y' values

1999-2005 - only do the average of top 5 highest 'y' values . .

2007-2013 - only do the average of top 5 highest 'y' values

2008-2014 - only do the average of top 5 highest 'y' values

I am interested in using data.table to achieve this. However also open to other suggestions. Here's what I tried

 d = setDT(d)
 d[, avg.Y := frollmean(y, 7), by = ID]

How do I enter another argument where for each rolling 7-years window I only select the top 5 highest y value to calculate the mean?

EDIT

I could also have a case that some IDs might not have minimum 7 years of data to do a moving average in which case the above function will give me NAs. For those IDs, is it possible to simply take an arithematic mean? For e.g. if a ID has data from 1998-2002, in such cases, can I simply take the average of y from 1998-2002

Upvotes: 3

Views: 573

Answers (3)

hello_friend
hello_friend

Reputation: 5798

A few more steps and a little bit repetitive base R solution:

df$seven_year_group <-  paste0(ave(as.integer(as.factor(df$yearRef)) %% 7,

                               as.integer(as.factor(df$yearRef)) %% 7,

                               FUN = seq.int), 

                           "_",

                           df$ID)

seven_year_averages <- data.frame(avg_y = do.call("rbind", lapply(split(df, df$seven_year_group),

                                             function(x){mean(tail(x[order(x$y), "y"], 5))})))



seven_year_averages$seven_year_group <- row.names(seven_year_averages)

df <- merge(df, seven_year_averages, by = "seven_year_group", all.x = TRUE)

Data:

set.seed(2019)

df <- data.frame(ID = rep(1:5, each = 17), yearRef = rep(1998:2014, times = 5), y = sample(1:100, 17 * 5))

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389275

We can use rollapplyr from zoo and apply a custom function to calculate mean of top 5 values in each rolling window.

library(data.table)
library(zoo)

setDT(d)
d[, avg.Y:= rollapplyr(y, 7,function(x) mean(tail(sort(x), 5)), fill = NA), by = ID]

For cases where there could be less number of observations than the window size we can do

d[, avg.Y:= if (.N > 6) 
            rollapplyr(y, 7,function(x) mean(tail(sort(x), 5)), fill = NA)  
            else mean(y), by = ID]

Upvotes: 2

s_baldur
s_baldur

Reputation: 33613

First time using frollapply() but this seems to work:

get_mean_top5 <- function(x) mean(-sort(-x, partial = 1:5)[1:5])
d[, test := frollapply(y, 7, FUN = get_mean_top5), by = ID]

The function get_mean_top5() filters out the top 5 highest values and then takes the mean. Other more readable forms would be:

get_mean_top5 <- function(x) mean(mean(x[order(x, decreasing=TRUE)[1:5]]))

Upvotes: 2

Related Questions