Reputation: 21
I'm trying to figure out how to incorporate lagged dependent variables into statsmodel or scikitlearn to forecast time series with AR terms but cannot seem to find a solution.
The general linear equation looks something like this:
y = B1*y(t-1) + B2*x1(t) + B3*x2(t-3) + e
I know I can use pd.Series.shift(t) to create lagged variables and then add it to be included in the model and generate parameters, but how can I get a prediction when the code does not know which variable is a lagged dependent variable?
In SAS's Proc Autoreg, you can designate which variable is a lagged dependent variable and will forecast accordingly, but it seems like there are no options like that in Python.
Any help would be greatly appreciated and thank you in advance.
Upvotes: 2
Views: 3127
Reputation: 375
Since you're already mentioned statsmodels
in your tags you may want to take a look at statsmodels - ARIMA, i.e.:
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(endog=t, order=(2, 0, 0)) # p=2, d=0, q=0 for AR(2)
fit = model.fit()
fit.summary()
But like you mentioned, you could create new variables manually the way you described (I used some random data):
import numpy as np
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'])
df['random_variable'] = np.random.randint(0, 10, len(df))
df['y'] = np.random.rand(len(df))
df.index = df['date']
df = df[['y', 'value', 'random_variable']]
df.columns = ['y', 'x1', 'x2']
shifts = 3
for variable in df.columns.values:
for t in range(1, shifts + 1):
df[f'{variable} AR({t})'] = df.shift(t)[variable]
df = df.dropna()
>>> df.head()
y x1 x2 ... x2 AR(1) x2 AR(2) x2 AR(3)
date ...
1991-10-01 0.715115 3.611003 7 ... 5.0 7.0 7.0
1991-11-01 0.202662 3.565869 3 ... 7.0 5.0 7.0
1991-12-01 0.121624 4.306371 7 ... 3.0 7.0 5.0
1992-01-01 0.043412 5.088335 6 ... 7.0 3.0 7.0
1992-02-01 0.853334 2.814520 2 ... 6.0 7.0 3.0
[5 rows x 12 columns]
I'm using the model you describe in your post:
model = sm.OLS(df['y'], df[['y AR(1)', 'x1', 'x2 AR(3)']])
fit = model.fit()
>>> fit.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.696
Model: OLS Adj. R-squared: 0.691
Method: Least Squares F-statistic: 150.8
Date: Tue, 08 Oct 2019 Prob (F-statistic): 6.93e-51
Time: 17:51:20 Log-Likelihood: -53.357
No. Observations: 201 AIC: 112.7
Df Residuals: 198 BIC: 122.6
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
y AR(1) 0.2972 0.072 4.142 0.000 0.156 0.439
x1 0.0211 0.003 6.261 0.000 0.014 0.028
x2 AR(3) 0.0161 0.007 2.264 0.025 0.002 0.030
==============================================================================
Omnibus: 2.115 Durbin-Watson: 2.277
Prob(Omnibus): 0.347 Jarque-Bera (JB): 1.712
Skew: 0.064 Prob(JB): 0.425
Kurtosis: 2.567 Cond. No. 41.5
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
Hope this helps you getting started.
Upvotes: 1