Reputation: 495

Using rollmean to calculate a moving average excluding the first observation in R

I'm currently working on translating some commands for time-series data in Stata into R. I'm using the zoo package to calculate moving averages in R. Here is what my data looks like:

data <- cbind(c(1960:1970), c(95.5, 95.3, 95.3, 95.7, 95.7, 95.7, 
95.1, 95.1, 95.1, 95, 95))

      [,1] [,2]
 [1,] 1960 95.5
 [2,] 1961 95.3
 [3,] 1962 95.3
 [4,] 1963 95.7
 [5,] 1964 95.7
 [6,] 1965 95.7
 [7,] 1966 95.1
 [8,] 1967 95.1
 [9,] 1968 95.1
[10,] 1969 95.0
[11,] 1970 95.0

I'll make this into a data.frame:

data <- as.data.frame(data)

Now, I can use the rollmean function to calculate the moving averages for turnout with my data:

data$turnout <- rollmean(data[,2], 1, fill = NA)

And this is what I get:

     V1   V2 turnout
1  1960 95.5 95.5
2  1961 95.3 95.3
3  1962 95.3 95.3
4  1963 95.7 95.7
5  1964 95.7 95.7
6  1965 95.7 95.7
7  1966 95.1 95.1
8  1967 95.1 95.1
9  1968 95.1 95.1
10 1969 95.0 95.0
11 1970 95.0 95.0

This is all well and good, but my issue is that I want my column turnout (moving average) to start at 1961 instead of 1960. This code does not exclude the first observation, which is what I am trying to do.

For reference, the equivalent Stata command would be:

tssmooth ma m1turnout = turnout, window (1 0)

I have already tried using the align = "right" function, but that doesn't seem to do the trick. Any ideas?

Thanks in advance!

Edit--to clarify, I'm doing this across different lengths. In Stata the full code is as such, where since is a variable that describes the number of years since an intervention.

foreach y of numlist 1(1)10{
        tssmooth ma m`y'turnout = turnout, window (`y' 0)
    }
    gen dvturnout=.
    foreach y of numlist 2(1)9{
        replace dvturnout = l1.turnout if since==1
        replace dvturnout = m`y'turnout if since==`y' & m`y'turnout!=.
        replace dvturnout = m10turnout if (since==10 & m10turnout!=.) | (since==. & redist!=. & m10turnout!=.)
    }
foreach y of numlist 1(1)10{
        drop m`y'turnout
    }

My ultimate goal is this dvturnout variable.

When I try what I presume corresponds to the first section of the code in Stata, that is:

 foreach y of numlist 1(1)10{
        tssmooth ma m`y'turnout = turnout, window (`y' 0)
    }

In R, I do this (where [,35] is the column I'm starting to add variables to):

for (j in 1:10) {
  data_countries[[i]][,35+j] <- rollmean(data_countries[[i]][,13], j, fill = NA, align = "right")
}
}

And it spits out this for me:

year since  V36   V37      V38    V39   V40      V41      V42     V43      V44   V45
1  1960    NA 95.5    NA       NA     NA    NA       NA       NA      NA       NA    NA
2  1961    NA 95.3 95.40       NA     NA    NA       NA       NA      NA       NA    NA
3  1962    NA 95.3 95.30 95.36667     NA    NA       NA       NA      NA       NA    NA
4  1963    NA 95.7 95.50 95.43333 95.450    NA       NA       NA      NA       NA    NA
5  1964    NA 95.7 95.70 95.56667 95.500 95.50       NA       NA      NA       NA    NA
6  1965    NA 95.7 95.70 95.70000 95.600 95.54 95.53333       NA      NA       NA    NA
7  1966    NA 95.1 95.40 95.50000 95.550 95.50 95.46667 95.47143      NA       NA    NA
8  1967    NA 95.1 95.10 95.30000 95.400 95.46 95.43333 95.41428 95.4250       NA    NA
9  1968    NA 95.1 95.10 95.10000 95.250 95.34 95.40000 95.38571 95.3750 95.38889    NA
10 1969    NA 95.0 95.05 95.06667 95.075 95.20 95.28333 95.34286 95.3375 95.33333 95.35
11 1970    NA 95.0 95.00 95.03333 95.050 95.06 95.16667 95.24286 95.3000 95.30000 95.30

These numbers are all fine, but they're "shifted" down from where I want them to be. Here is what the same operation gives me in Stata:

year    dvturnout   m1turnout   m2turnout   m3turnout   m4turnout   m5turnout   m6turnout   m7turnout   m8turnout   m9turnout   m10turnout
1960                                            
1961        95.5    95.5    95.5    95.5    95.5    95.5    95.5    95.5    95.5    95.5
1962        95.3    95.4    95.4    95.4    95.4    95.4    95.4    95.4    95.4    95.4
1963        95.3    95.3    95.36667    95.36667    95.36667    95.36667    95.36667    95.36667    95.36667    95.36667
1964        95.7    95.5    95.43333    95.45   95.45   95.45   95.45   95.45   95.45   95.45
1965        95.7    95.7    95.56667    95.5    95.5    95.5    95.5    95.5    95.5    95.5
1966        95.7    95.7    95.7    95.6    95.54   95.53333    95.53333    95.53333    95.53333    95.53333
1967        95.1    95.39999    95.5    95.55   95.5    95.46667    95.47143    95.47143    95.47143    95.47143
1968        95.1    95.1    95.3    95.39999    95.46   95.43333    95.41428    95.425  95.425  95.425
1969        95.1    95.1    95.1    95.25   95.34   95.39999    95.38571    95.375  95.38889    95.38889
1970        95  95.05   95.06667    95.075  95.2    95.28333    95.34286    95.3375 95.33334    95.35

Upvotes: 3

Answers (3)

Roman

Reputation: 4989

What you need is a moving average function that does not include the current observation. Thankfully, w_i_l_l wrote a function exactly like that. What made things complicated: the writer of your paper filled up the moving average that has not enough data (e.g., k = 4, but only 3 data points) with the result of the previous column. I would really not advise to do that as this can (and usually will) lead to major confusion, if not pointed out very explicitly.

Code

# w_i_l_l's moving average function
mav <- function(x,n){filter(x,rep(1/n,n), sides=1)} 
mavback <- function(x,n){
               a<-mav(x,1)
               b<-mav(x,(n+1))
               c<-(1/n)*((n+1)*b - a)
               return(c)
           }

# Create 10 columns with moving averages of k = 1:10
result <- NULL
for(i in 1:10){
    result <- cbind(result,mavback(test[,2], i))
}

# Give propers names to columns
colnames(result) <- paste0("m", 1:ncol(result)-1,"turnout")

# Combine result with base data
result <- cbind(test,data.frame(result))

# WONKY STATISTICS: If there is a NA (= not enough data for a
# moving average) fill it up with previous column's result
for(i in 4:ncol(result)){
    # Nested loop starts from first row
    for(j in 2:nrow(result)){
        # Check for NA
        if(is.na(result[j,i])){
            result[j,i] <- result[j,i-1]
        }
    }
}

Result

> result
   year turnout m0turnout m1turnout m2turnout m3turnout m4turnout m5turnout m6turnout m7turnout m8turnout m9turnout
1  1960    95.5        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA
2  1961    95.3      95.5     95.50  95.50000  95.50000  95.50000  95.50000  95.50000  95.50000  95.50000  95.50000
3  1962    95.3      95.3     95.40  95.40000  95.40000  95.40000  95.40000  95.40000  95.40000  95.40000  95.40000
4  1963    95.7      95.3     95.30  95.36667  95.36667  95.36667  95.36667  95.36667  95.36667  95.36667  95.36667
5  1964    95.7      95.7     95.50  95.43333  95.45000  95.45000  95.45000  95.45000  95.45000  95.45000  95.45000
6  1965    95.7      95.7     95.70  95.56667  95.50000  95.50000  95.50000  95.50000  95.50000  95.50000  95.50000
7  1966    95.1      95.7     95.70  95.70000  95.60000  95.54000  95.53333  95.53333  95.53333  95.53333  95.53333
8  1967    95.1      95.1     95.40  95.50000  95.55000  95.50000  95.46667  95.47143  95.47143  95.47143  95.47143
9  1968    95.1      95.1     95.10  95.30000  95.40000  95.46000  95.43333  95.41429  95.42500  95.42500  95.42500
10 1969    95.0      95.1     95.10  95.10000  95.25000  95.34000  95.40000  95.38571  95.37500  95.38889  95.38889
11 1970    95.0      95.0     95.05  95.06667  95.07500  95.20000  95.28333  95.34286  95.33750  95.33333  95.35000

Result without the "filling up"

> result
   year turnout m0turnout m1turnout m2turnout m3turnout m4turnout m5turnout m6turnout m7turnout m8turnout m9turnout
1  1960    95.5        NA        NA        NA        NA        NA        NA        NA        NA        NA        NA
2  1961    95.3      95.5        NA        NA        NA        NA        NA        NA        NA        NA        NA
3  1962    95.3      95.3     95.40        NA        NA        NA        NA        NA        NA        NA        NA
4  1963    95.7      95.3     95.30  95.36667        NA        NA        NA        NA        NA        NA        NA
5  1964    95.7      95.7     95.50  95.43333    95.450        NA        NA        NA        NA        NA        NA
6  1965    95.7      95.7     95.70  95.56667    95.500     95.50        NA        NA        NA        NA        NA
7  1966    95.1      95.7     95.70  95.70000    95.600     95.54  95.53333        NA        NA        NA        NA
8  1967    95.1      95.1     95.40  95.50000    95.550     95.50  95.46667  95.47143        NA        NA        NA
9  1968    95.1      95.1     95.10  95.30000    95.400     95.46  95.43333  95.41429   95.4250        NA        NA
10 1969    95.0      95.1     95.10  95.10000    95.250     95.34  95.40000  95.38571   95.3750  95.38889        NA
11 1970    95.0      95.0     95.05  95.06667    95.075     95.20  95.28333  95.34286   95.3375  95.33333     95.35

Data

test <- data.frame(cbind(year = c(1960:1970), 
                         turnout = c(95.5, 95.3, 95.3, 95.7, 95.7,
                                     95.7, 95.1, 95.1, 95.1, 95, 95)))

Upvotes: 4

Julian

Reputation: 495

I found the simplest way to work this was with the lag function.

data$turnout <- lag(rollmean(data[,2], 1, fill = NA),1)

Upvotes: 1

Samuel

Reputation: 3053

Maybe you are looking for something like this:

library(zoo)
library(forecast)
data <- cbind(c(1960:1970), c(95.5, 95.3, 95.3, 95.7, 95.7, 95.7, 95.1, 95.1, 95.1, 95, 95)) 
x1 <- ts(data = data[, 2], start = 1960, end = 1970, frequency = 1)
x2 <- cbind(x1, turnout = zoo::rollmeanr(x1, k = 2))

Print the time series object:

x2
Time Series:
Start = 1960 
End = 1970 
Frequency = 1 
       x1 turnout
1960 95.5      NA
1961 95.3   95.40
1962 95.3   95.30
1963 95.7   95.50
1964 95.7   95.70
1965 95.7   95.70
1966 95.1   95.40
1967 95.1   95.10
1968 95.1   95.10
1969 95.0   95.05
1970 95.0   95.00

Plot:

forecast::autoplot(x2)