Reputation: 6649
I have a dataframe showing a date, an item and a value, and I want to add a column showing the average of its 50 previous entries (or NA if it hasn't had 50) e.g. the table could be
data
date item value
01/01/01 a 2
01/01/01 b 1.5
04/01/01 c 1.7
05/01/01 a 1.9
......
and part of it could become
date item value last_50_mean
........
11/09/01 a 1.2 1.1638
12/09/01 b 1.9 1.5843
12/09/01 a 1.4 1.1621
13/09/01 c 0.9 NA
........
So in this case the mean of a in the 50 entries before 11/09/01 is 1.1638 and c hasn't had 50 entries before 13/09/01 so returns NA
I am currently doing this using the following function
data[, 'last_50_mean'] <- sapply(1:nrow(data), function(i){
prevDates <- data[data$date < data$date[i] & data$item == data$item[i], ]
num <- nrow(prevGames)
if(nGames >= 50){
round(mean(prevDates[(num- 49):num, ]$value), 4)
}
}
)
But my dataframe is large and it is taking a long time (in fact I'm not 100% sure it works as it is still running... Does anyone know of the best way to do this?
Upvotes: 2
Views: 179
Reputation: 46866
The mean of N observations can be calculated from the cumulative sum and the difference between the first and last value, diff(cumsum(x), lag=N - 1)
. Your question wants the first N - 1 values to be padded, so
meanN <- function(x, N=50)
## mean of last N observations, padded in front with NA
{
x0 <- x[seq_len(length(x) - N + 1)]
x1 <- (x0 + diff(cumsum(x), lag=N-1)) / N
c(rep(NA, N - 1), x1)
}
You'd like to do this for several groups. For a data.frame
like
df <- data.frame(item=sample(letters[1:3], 1000, TRUE),
value=runif(1000, 1, 3),
last_50_mean=NA)
one way of doing this is
split(df$last_50_mean, df$item) <- lapply(split(df$value, df$item), meanN)
leading to for instance
> tail(df)
item value last_50_mean
995 c 1.191486 2.037707
996 c 2.899214 2.073022
997 c 2.019375 2.054914
998 c 2.737043 2.066389
999 a 1.703752 1.923234
1000 c 1.602442 2.043517
This assumes that your data frame is ordered by time. A potential problem is when long vectors overflow cumsum
; one could address this by centering value
so the expectation is that cumsum
doesn't stray too far from zero. A recent question addressed alternatives to split<-
and dropping the last N observations.
Upvotes: 4