Reputation: 21

How to Incorporate and Forecast Lagged Time-Series Variables in a Python Regression Model

I'm trying to figure out how to incorporate lagged dependent variables into statsmodel or scikitlearn to forecast time series with AR terms but cannot seem to find a solution.

The general linear equation looks something like this:

y = B1*y(t-1) + B2*x1(t) + B3*x2(t-3) + e

I know I can use pd.Series.shift(t) to create lagged variables and then add it to be included in the model and generate parameters, but how can I get a prediction when the code does not know which variable is a lagged dependent variable?

In SAS's Proc Autoreg, you can designate which variable is a lagged dependent variable and will forecast accordingly, but it seems like there are no options like that in Python.

Any help would be greatly appreciated and thank you in advance.

Upvotes: 2

Answers (1)

Arno Maeckelberghe

Reputation: 375

Since you're already mentioned statsmodels in your tags you may want to take a look at statsmodels - ARIMA, i.e.:

from statsmodels.tsa.arima_model import ARIMA

model = ARIMA(endog=t, order=(2, 0, 0))  # p=2, d=0, q=0 for AR(2)
fit = model.fit()
fit.summary()

But like you mentioned, you could create new variables manually the way you described (I used some random data):

import numpy as np
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'])
df['random_variable'] = np.random.randint(0, 10, len(df))
df['y'] = np.random.rand(len(df))
df.index = df['date']
df = df[['y', 'value', 'random_variable']]
df.columns = ['y', 'x1', 'x2']

shifts = 3

for variable in df.columns.values:
    for t in range(1, shifts + 1):
        df[f'{variable} AR({t})'] = df.shift(t)[variable]

df = df.dropna()

>>> df.head()
                   y        x1  x2    ...     x2 AR(1)  x2 AR(2)  x2 AR(3)
date                                  ...                                 
1991-10-01  0.715115  3.611003   7    ...          5.0       7.0       7.0
1991-11-01  0.202662  3.565869   3    ...          7.0       5.0       7.0
1991-12-01  0.121624  4.306371   7    ...          3.0       7.0       5.0
1992-01-01  0.043412  5.088335   6    ...          7.0       3.0       7.0
1992-02-01  0.853334  2.814520   2    ...          6.0       7.0       3.0
[5 rows x 12 columns]

I'm using the model you describe in your post:

model = sm.OLS(df['y'], df[['y AR(1)', 'x1', 'x2 AR(3)']])
fit = model.fit()

>>> fit.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.696
Model:                            OLS   Adj. R-squared:                  0.691
Method:                 Least Squares   F-statistic:                     150.8
Date:                Tue, 08 Oct 2019   Prob (F-statistic):           6.93e-51
Time:                        17:51:20   Log-Likelihood:                -53.357
No. Observations:                 201   AIC:                             112.7
Df Residuals:                     198   BIC:                             122.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
y AR(1)        0.2972      0.072      4.142      0.000       0.156       0.439
x1             0.0211      0.003      6.261      0.000       0.014       0.028
x2 AR(3)       0.0161      0.007      2.264      0.025       0.002       0.030
==============================================================================
Omnibus:                        2.115   Durbin-Watson:                   2.277
Prob(Omnibus):                  0.347   Jarque-Bera (JB):                1.712
Skew:                           0.064   Prob(JB):                        0.425
Kurtosis:                       2.567   Cond. No.                         41.5
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""

Hope this helps you getting started.

Upvotes: 1

How to Incorporate and Forecast Lagged Time-Series Variables in a Python Regression Model

Answers (1)

Related Questions