Jim
Jim

Reputation: 51

i am confused with the R implementation of lag in Regression analysis

look at this linear regression: Y ~ X + lag(X,1) ,the meaning is very clear that it is trying to do a linear regression. and the lag(X,1) means the first lag of X. What confuse me is the R implementation of lag function. In R the lag(X, 1) moves X to the prior time, for example

>library(zoo) 
>
>str(zoo(x))
‘zoo’ series from 1 to 4 
Data: num [1:4] 11 12 13 14
Index:int [1:4] 1 2 3 4
>lag(zoo(x))
1  2  3
12 13 14

when you regress, which value does the R use exactly at time 2? I guess R use the data like this:

time 1   2   3   4
 Y      anything
 X   11  12  13  14
lagX 12  13  14

But this is nonsense! Because we are supposed to use the fisrt lag of X and the current X at time 2 (or any specific time ), that is 11 and 12 , not 13 12 as above! The fisrt lag of X should be the prior X , isn't it? I am so confused! Please explain to me, thanks a lot.

Upvotes: 1

Views: 6136

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269694

The question starts out with:

look at this linear regression: Y ~ X + lag(X,1) ,the meaning is very clear that it is trying to do a linear regression. and the lag(X,1) means the first lag of X

Actually that is not the case. It does not refer to this model:

Y[i] = a + b * X[i] + c * X[i-1] + error[i]

It actually refers to this model:

Y[i] = a + b * X[i] + c * X[i+1] + error[i]

which is not likely what you intended.

It is likely that you wanted lag(X, -1) rather than lag(X, 1). Lagging a series in R means that the lagged series starts earlier which implies that the series itself moves forward.

The other item to be careful of is that lm does not align series. It knows nothing about the time index. You will need to align the series yourself or use a package which does it for you.

More on these points below.

ts

First let us consider lag.ts from the core of R since lag.zoo and lag.zooreg are based on it and consistent with it. lag.ts lags the times of the series so that the lagged series starts earlier. That is if we have a series whose values are 11, 12, 13 and 14 at times 1, 2, 3 and 4 respectively lag.ts lags each time so that the lagged series has the same values 11, 12, 13 and 14 but at the times 0, 1, 2, 3. The original series started at 1 but the lagged series starts at 0. Originally the value 12 was at time 2 but in the lagged series the value 13 is at time 2. In code, we have:

tt <- ts(11:14)
cbind(tt, lag(tt), lag(tt, 1), lag(tt, -1))

gives:

Time Series:
Start = 0 
End = 5 
Frequency = 1 
  tt lag(tt) lag(tt, 1) lag(tt, -1)
0 NA      11         11          NA
1 11      12         12          NA
2 12      13         13          11
3 13      14         14          12
4 14      NA         NA          13
5 NA      NA         NA          14

zoo

lag.zoo is consistent with lag.ts. Note that since zoo represents irrelgularly spaced series it cannot assume that time 0 comes before time 1. We could only make such an assumption if we knew the series were regularly spaced. Thus if time 1 is the earliest time in a series the value at this time is dropped since there is no way to determine what earlier time to lag it to. The new lagged series now starts at the second time value in the original series. This is similar to the lag.ts example except in the lag.ts there was a 0 time and in this example there is no such time. Similarly we cannot extend the time scale forward in time either.

library(zoo)
z <- zoo(11:14)
merge(z, lag(z), lag(z, 1), lag(z,-1))

giving:

   z lag(z) lag(z, 1) lag(z, -1)
1 11     12        12         NA
2 12     13        13         11
3 13     14        14         12
4 14     NA        NA         13

zooreg

The zoo package does have a zooreg class which assumes regularly spaced series except for some missing values and it can deduce what comes before just as ts can. With zooreg it can deduce that time 0 comes before and time 5 comes after.

library(zoo)
zr <- zooreg(11:14)
merge(zr, lag(zr), lag(zr, 1), lag(zr,-1))

giving:

  zr lag(zr) lag(zr, 1) lag(zr, -1)
0 NA      11         11          NA
1 11      12         12          NA
2 12      13         13          11
3 13      14         14          12
4 14      NA         NA          13
5 NA      NA         NA          14

lm

lm does not know anything about zoo and will ignore the time index entirely. If you want to not ignore it, i.e. you want to align the series involved prior to running the regression, use the dyn (or dynlm) package. Using the former:

library(dyn)
set.seed(123)
zr <- zooreg(rnorm(10))
y <- 1 + 2 * zr + 3 * lag(zr, -1)
dyn$lm(y ~ zr + lag(zr, -1))

giving:

Call:
lm(formula = dyn(y ~ zr + lag(zr, -1)))

Coefficients:
(Intercept)           zr  lag(zr, -1)  
          1            2            3  

Note 1: Be sure to read the documentation in the help files: ?lag.ts , ?lag.zoo , ?lag.zooreg and help(package = dyn)

Note 2: If the direction of the lag seems confusing you could define your own function and use that in place of lag. For example, this gives the same coefficients as the lm output shown above:

Lag <- function(x, k = 1) lag(x, -k)
dyn$lm(y ~ zr + Lag(zr))

An additional word of warning is that unlike lag.zoo and lag.zooreg which are consistent with the core of R, lag.xts from the xts package is inconsistent. Also the lag in dplyr is also inconsistent (and to make things worse if you load dplyr then dplyr will mask lag in R with its own inconsistent version of lag. Also note that L in dynlm works the same as Lag but wisely used a different name to avoid confusion.

Upvotes: 2

Nelewout
Nelewout

Reputation: 6564

Please, consult the manual first:

Description

Compute a lagged version of a time series, shifting the time base back by a given number of observations.

Default S3 method:

lag(x, k = 1, ...)

Arguments

x A vector or matrix or univariate or multivariate time series

k The number of lags (in units of observations).

So, lag does not return a lagged value. It returns the entire lagged time series, shifted back by some k. This is not something a simple lm can work with, and indeed not what you want to use. This, however, does work for me:

library(zoo)

x <- zoo(c(11, 12, 13, 14))
y <- c(1, 2.3, 3.8, 4.2)

lagged <- lag(x, -1)
lagged <- c(lagged, c=0) # first lag is defined as zero

model <- lm(y ~ x + lagged)
summary(model)

Returns:

Call:
lm(formula = y ~ x + lagged)

Residuals:
         1          2          3          4 
-8.327e-17 -1.833e-01  3.667e-01 -1.833e-01 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.86333    4.20149  -2.110    0.282
x            0.89667    0.38456   2.332    0.258
lagged       0.05333    0.08199   0.650    0.633

Residual standard error: 0.4491 on 1 degrees of freedom
Multiple R-squared:  0.9687,    Adjusted R-squared:  0.9062 
F-statistic: 15.49 on 2 and 1 DF,  p-value: 0.1769

Upvotes: 1

Related Questions