Mean of 50 most recent entries in R

Question

I have a dataframe showing a date, an item and a value, and I want to add a column showing the average of its 50 previous entries (or NA if it hasn't had 50) e.g. the table could be

      data
date     item value  
01/01/01 a    2  
01/01/01 b    1.5  
04/01/01 c    1.7  
05/01/01 a    1.9  
......

and part of it could become

date     item value last_50_mean   
........ 
11/09/01 a    1.2   1.1638
12/09/01 b    1.9   1.5843 
12/09/01 a    1.4   1.1621
13/09/01 c    0.9   NA
........

So in this case the mean of a in the 50 entries before 11/09/01 is 1.1638 and c hasn't had 50 entries before 13/09/01 so returns NA

I am currently doing this using the following function

  data[, 'last_50_mean'] <- sapply(1:nrow(data), function(i){
        prevDates <- data[data$date < data$date[i] & data$item == data$item[i], ]
        num       <- nrow(prevGames)
        if(nGames >= 50){
          round(mean(prevDates[(num- 49):num, ]$value), 4)
        }
      }
  )

But my dataframe is large and it is taking a long time (in fact I'm not 100% sure it works as it is still running... Does anyone know of the best way to do this?

Martin Morgan · Accepted Answer

The mean of N observations can be calculated from the cumulative sum and the difference between the first and last value, diff(cumsum(x), lag=N - 1). Your question wants the first N - 1 values to be padded, so

meanN <- function(x, N=50)
    ## mean of last N observations, padded in front with NA
{
    x0 <- x[seq_len(length(x) - N + 1)]
    x1 <- (x0 + diff(cumsum(x), lag=N-1)) / N
    c(rep(NA, N - 1), x1)
}

You'd like to do this for several groups. For a data.frame like

df <- data.frame(item=sample(letters[1:3], 1000, TRUE),
                 value=runif(1000, 1, 3),
                 last_50_mean=NA)

one way of doing this is

split(df$last_50_mean, df$item) <- lapply(split(df$value, df$item), meanN)

leading to for instance

> tail(df)
     item    value last_50_mean
995     c 1.191486     2.037707
996     c 2.899214     2.073022
997     c 2.019375     2.054914
998     c 2.737043     2.066389
999     a 1.703752     1.923234
1000    c 1.602442     2.043517

This assumes that your data frame is ordered by time. A potential problem is when long vectors overflow cumsum; one could address this by centering value so the expectation is that cumsum doesn't stray too far from zero. A recent question addressed alternatives to split<- and dropping the last N observations.

Mean of 50 most recent entries in R

Answers (1)

Related Questions