Reputation: 3805
library(data.table)
set.seed(123)
d <- data.frame(ID = rep(1:5, each = 17), yearRef = rep(1998:2014, times = 5), y = sample(1:100, 17 * 5))
For each ID, I want to do a 7-years rolling average of y
starting from 1998 onwards. However, the condition is that in each rolling window,
I only select the top 5 highest value of y to do the average. For e.g.
first rolling window would be
1998-2004 - only do the average of top 5 highest 'y' values
1999-2005 - only do the average of top 5 highest 'y' values . .
2007-2013 - only do the average of top 5 highest 'y' values
2008-2014 - only do the average of top 5 highest 'y' values
I am interested in using data.table to achieve this. However also open to other suggestions. Here's what I tried
d = setDT(d)
d[, avg.Y := frollmean(y, 7), by = ID]
How do I enter another argument where for each rolling 7-years window I only select the top 5 highest y value to calculate the mean?
I could also have a case that some IDs might not have minimum 7 years of data to do a moving average in which case the above function will give me NAs. For those IDs, is it possible to simply take an arithematic mean? For e.g. if a ID has data from 1998-2002, in such cases, can I simply take the average of y
from 1998-2002
Upvotes: 3
Views: 573
Reputation: 5798
A few more steps and a little bit repetitive base R solution:
df$seven_year_group <- paste0(ave(as.integer(as.factor(df$yearRef)) %% 7,
as.integer(as.factor(df$yearRef)) %% 7,
FUN = seq.int),
"_",
df$ID)
seven_year_averages <- data.frame(avg_y = do.call("rbind", lapply(split(df, df$seven_year_group),
function(x){mean(tail(x[order(x$y), "y"], 5))})))
seven_year_averages$seven_year_group <- row.names(seven_year_averages)
df <- merge(df, seven_year_averages, by = "seven_year_group", all.x = TRUE)
Data:
set.seed(2019)
df <- data.frame(ID = rep(1:5, each = 17), yearRef = rep(1998:2014, times = 5), y = sample(1:100, 17 * 5))
Upvotes: 1
Reputation: 389275
We can use rollapplyr
from zoo
and apply a custom function to calculate mean
of top 5 values in each rolling window.
library(data.table)
library(zoo)
setDT(d)
d[, avg.Y:= rollapplyr(y, 7,function(x) mean(tail(sort(x), 5)), fill = NA), by = ID]
For cases where there could be less number of observations than the window size we can do
d[, avg.Y:= if (.N > 6)
rollapplyr(y, 7,function(x) mean(tail(sort(x), 5)), fill = NA)
else mean(y), by = ID]
Upvotes: 2
Reputation: 33613
First time using frollapply()
but this seems to work:
get_mean_top5 <- function(x) mean(-sort(-x, partial = 1:5)[1:5])
d[, test := frollapply(y, 7, FUN = get_mean_top5), by = ID]
The function get_mean_top5()
filters out the top 5 highest values and then takes the mean. Other more readable forms would be:
get_mean_top5 <- function(x) mean(mean(x[order(x, decreasing=TRUE)[1:5]]))
Upvotes: 2