JBR
JBR

Reputation: 21

Need moving average function to only consider previous observations

I need to create a moving average of a variable, that only considers previous observations of this variable, for each different data group.

I used to use a function, and then hack my variables a bit to make it work. Let me explain below.

I got this function from stackoverflow :

mav <- function(x,n) if(length(x) >= n)stats::filter(x,rep(1/n,n), sides=1) else NA_real_ 

Let's take the example of a moving average on 2 observations :

test = data.table("values" = c(1,2,3,4,5,6,7,8, 9,10,11,12), "category" = c(1,1,1,1,1,1,2,2,2,2,2,2))
test[, ma2 := as.numeric(mav(values, n = 2)), by = category]

This yields :

   values category  ma2
      1        1   NA
      2        1  1.5
      3        1  2.5
      4        1  3.5
      5        1  4.5
      6        1  5.5
      7        2   NA
      8        2  7.5
      9        2  8.5
     10        2  9.5
     11        2 10.5
     12        2 11.5

I want the 3rd observation of ma2 to be the mean of the last two observations of ma2. But here, the 3rd observation of ma2 is the mean of the 2nd and 3rd observation.

So I create "Vprev", another variable, that is the same as "Values", but takes the previous value of "Values" for each observation :

test[, vprev:= as.numeric(shift(values, 1L, type = "lag" )), by = category]

And then, I run the moving average ("TRUEma2") on the vprev variable instead :

test[, TRUEma2 := as.numeric(mav(vprev, n = 2)), by = category] 

values category  ma2 vprev TRUEma2
  1        1   NA    NA      NA
  2        1  1.5     1      NA
  3        1  2.5     2     1.5
  4        1  3.5     3     2.5
  5        1  4.5     4     3.5
  6        1  5.5     5     4.5
  7        2   NA    NA      NA
  8        2  7.5     7      NA
  9        2  8.5     8     7.5
 10        2  9.5     9     8.5
 11        2 10.5    10     9.5
 12        2 11.5    11    10.5

That used to work just fine, because my data sets were pretty small. But now I have to do this on multiple data sets that have about 2 to 3 millions observations. And I have to create moving averages for about 30 variables in each data set. The process I described takes up to 1min 40secs for each variable, so I calculated that I would need 25 hours to preprocess all my data sets...

I saw that what takes the most time is the part where I create a new variable that is the previous observation of another variable (approximately 1 minute) :

test[, vprev:= as.numeric(shift(values, 1L, type = "lag" )), by = category]

The moving average itself does not take a lot of time to compute.

I tried skipping this by putting a shift() in the moving average code line :

test[, TRUEma2 := as.numeric(mav(shift(values,1L,type = "lag), n = 2)), by = category]   

But it's not faster.

I also tried to modify the moving average function this way :

mav2 <- function(x,n) if(length(x) >= n+1)stats::filter(x-1,rep(1/n,n), sides=1) else NA_real_ 

But then the first value of x can take the value of the observation before it, that is not in the same data group/category.

     values category mav2
      1        1   NA
      2        1  0.5
      3        1  1.5
      4        1  2.5
      5        1  3.5
      6        1  4.5
      7        2   NA
      8        2  6.5
      9        2  7.5
     10        2  8.5
     11        2  9.5
     12        2 10.5

So here is my question : would it be possible to have a moving average function that is as fast the first one described above, but that only calculates the mean on previous observations ?

Thanks a lot for your help :)

EDIT : I tried the solutions proposed by lbusett and Icecreamtoucan, and although it worked on test data, Igot the following error message on the real data :

Error in[.data.table(toptrain2, ,:=(paste0("m3_", c("killsM")), : Type of RHS ('double') must match LHS ('logical'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)

I was asked to give a sample of the actual data. Here is a dput (just an little extract of my data) :

structure(list(killsM = c(4L, 2L, 0L, 3L, 6L, 0L, 1L, 2L, 3L, 5L, 6L, 1L, 4L, 4L, 2L, 6L, 6L, 3L, 1L, 2L), soloKillsM = c(4L, 2L, 0L, 0L, 3L, 0L, 0L, 1L, 1L, 3L, 0L, 0L, 1L, 2L, 0L, 3L, 0L, 1L, 0L, 0L), deathsM = c(3L, 5L, 5L, 1L, 4L, 4L, 3L, 2L, 0L, 4L, 1L, 7L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L), assistsM = c(1L, 1L, 2L, 2L, 7L, 0L, 2L, 2L, 3L, 0L, 4L, 1L, 0L, 1L, 1L, 1L, 4L, 1L, 3L, 3L), killParticipationM = c(0.151515151515152, 0.0909090909090909, 0.125, 0.3125, 0.464285714285714, 0, 0.157894736842105, 0.210526315789474, 0.222222222222222, 0.185185185185185, 0.434782608695652, 0.0869565217391304, 0.2, 0.25, 0.130434782608696, 0.304347826086957, 0.4, 0.16, 0.181818181818182, 0.227272727272727), firstTowerKillM = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA, 20L), class = "data.frame")

To me, it seems that the only difference with the test data is the name of the variables and the value of observations

Upvotes: 1

Views: 880

Answers (3)

B. Christian Kamgang
B. Christian Kamgang

Reputation: 6489

You can combine the functions shift and rollmeanr in data.table and zoo packages, respectively, as follow.

library(data.table)
library(zoo)
test = data.table(values = 1:12, category = rep(1:2, each = 6))
test[, mg2 := shift(rollmeanr(values, 2, fill = NA)), category]

   values category      mg2
1:      1        1       NA
2:      2        1       NA
3:      3        1      1.5
4:      4        1      2.5
5:      5        1      3.5
6:      6        1      4.5
7:      7        2       NA
8:      8        2       NA
9:      9        2      7.5
10:     10       2      8.5
11:     11       2      9.5
12:     12       2     10.5

Upvotes: 0

E. Brown
E. Brown

Reputation: 416

I think you could speed this up by putting shift in the function you are using to calculate the average, e.g.

mav_shift <- function(x,n) if(length(x) >= n)stats::filter(shift(x),rep(1/n,n), sides=1) else NA_real_

By my quick test, this very slightly increases the time to run the function, and removes the step of creating a new variable. Please test to make sure it works as expected, but the results from your sample data appear the same.

EDIT and quicker solution:

mav_shift <- function(x,n) {
  if(length(x) >= n) { 
    stats::filter(shift(x),rep(1/n,n), sides=1) 
  } else NA_real_

result <- by(test$values, test$category, mav_shift, n=2, simplify=T)
test$new <- as.vector(unlist(result))

Upvotes: 0

lbusett
lbusett

Reputation: 5932

What about shifting the results instead of the input values? Something like this (using rollmean from package zoo):

library(data.table)
library(zoo)
test = data.table("values" = c(1,2,3,4,5,6,7,8, 9,10,11,12), 
                  "category" = c(1,1,1,1,1,1,2,2,2,2,2,2))
test[, paste0("ravg_", c("values")) := shift(lapply(
  .SD, rollmean, k = 2, na.pad = TRUE, align = "right"), 1), 
  .SDcols = c("values"), by = category]

    values category ravg_values
 1:      1        1          NA
 2:      2        1          NA
 3:      3        1         1.5
 4:      4        1         2.5
 5:      5        1         3.5
 6:      6        1         4.5
 7:      7        2          NA
 8:      8        2          NA
 9:      9        2         7.5
10:     10        2         8.5
11:     11        2         9.5
12:     12        2        10.5

You can also adapt it to multiple columns easily (see https://stackoverflow.com/a/31482551/6871135)

Upvotes: 1

Related Questions