user2715182
user2715182

Reputation: 723

fit linear regression model for a variable that depends on past values in R

I am working on a model that is similar to time series prediction.

I have to fit a linear regression model to a target variable(TV) which has two other dependent variables(X and Y) and also on its own past values.

Basically the model looks like this:

TV(t) ~ X(t) + Y(t) + TV(t-1) + TV(t-2) + TV(t-3)

I got stuck attempting at converting this R code

model <- lm(modeldata$TV ~ modeldata$X  +modeldata$Y+ ??)

How do i write the R code to fit this kind of model?.

Upvotes: 0

Views: 982

Answers (2)

crocodile
crocodile

Reputation: 129

One of the possible solutions is to use the Hadley Wickham's dplyr package and its lag() function. Here is a complete example. We first create a simple modeldata.

modeldata <- data.frame(X=1:10, Y=1:10, TV=1:10)
modeldata
X  Y TV
1   1  1  1
2   2  2  2
3   3  3  3
4   4  4  4
5   5  5  5
6   6  6  6
7   7  7  7
8   8  8  8
9   9  9  9
10 10 10 10

Then we load dplyr package and use its mutate() function. We create new columns in the data frame using lag() function.

library(dplyr)
modeldata <- mutate(modeldata, TVm1 = lag(TV,1), TVm2 = lag(TV,2), TVm3 = lag(TV, 3))
modeldata
X  Y TV TVm1 TVm2 TVm3
1   1  1  1   NA   NA   NA
2   2  2  2    1   NA   NA
3   3  3  3    2    1   NA
4   4  4  4    3    2    1
5   5  5  5    4    3    2
6   6  6  6    5    4    3
7   7  7  7    6    5    4
8   8  8  8    7    6    5
9   9  9  9    8    7    6
10 10 10 10    9    8    7

Lastly we provide all variables from our data frame (using ~. notation) to lm() function.

model <- lm(TV ~ ., data = modeldata)

To obtain predictions based on this model, we have to prepare test set in the same way.

testdata <- data.frame(X = 11:15, Y = 11:15, TV = 11:15)
testdata <- mutate(testdata, TVm1 = lag(TV,1), TVm2 = lag(TV,2), TVm3 = lag(TV, 3))
predict(model, newdata = testdata)

In this case we can obtain prediction only for observation 14 and 15 in testdata. For earlier observations, we are not able to calculate all lag values.

Of course, we assume that we have some kind of time series data. Otherwise, it is not possible to fit and use such model.

Upvotes: 2

IRTFM
IRTFM

Reputation: 263332

You need to build the proper dataset before sending to lm. Some lag functions exist: one in the dply package and a different one for use with time series objects. You might get a quick approach to creating a lagged version of TV with:

 laggedVar <- embed(Var, 4)

E.g.

> embed(1:10, 4)
     [,1] [,2] [,3] [,4]
[1,]    4    3    2    1
[2,]    5    4    3    2
[3,]    6    5    4    3
[4,]    7    6    5    4
[5,]    8    7    6    5
[6,]    9    8    7    6
[7,]   10    9    8    7

You might also look at the regression methods designed for use with panel data that might be expected to have some degree of auto-correlation.

Upvotes: 0

Related Questions