sdave
sdave

Reputation: 568

Issue with combining regression model and ARIMA errors in time series forecasting

I am working on a time series forecasting problem using a combination of a regression model and ARIMA errors. The regression model is implemented using the sm.OLS function from the statsmodels library, and the ARIMA model is fitted to the residuals obtained from the regression model.

Explanation of Predictors:

  1. sweek: Represents the statistical week number of the year.
  2. smonth: Represents the statistical month number.
  3. syear: Represents the statistical year.
  4. cost: Represents the cost/marketing spend associated with the particular time period.

Although the code provided below runs successfully, the results obtained are not satisfactory. I suspect that the default values used for the ARIMA order (1, 0, 0) may not be optimal for my data. I would like to perform a hyperparameter search to find the best values of p, d, and q for the ARIMA model.

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Step 1: Prepare the data
df = df

# Remove rows with empty values
df = df.dropna()

# Step 2: Feature engineering (if required)
# If you need to create additional features, you can do so in this step.

# Step 3: Split the data into training and testing sets
train_size = int(len(df) * 0.8)  # 80% of the data for training
train_data = df[:train_size]
test_data = df[train_size:]

# Step 4: Regression analysis
# Define the predictors (independent variables)
predictors = ['sweek', 'smonth', 'syear', 'cost']
X_train = train_data[predictors]
X_train = sm.add_constant(X_train)  # Add a constant term for the intercept
y_train = train_data['visits']

# Fit the regression model
reg_model = sm.OLS(y_train, X_train).fit()

# Step 5: ARIMA errors
# Obtain the residuals (errors) from the regression model
residuals = reg_model.resid

# Fit an ARIMA model to the residuals
arima_model = ARIMA(residuals, order=(1, 0, 0)) 
arima_model_fit = arima_model.fit()

# Step 6: Combine regression model and ARIMA errors
# Obtain the predicted values from the regression model
X_test = test_data[predictors]
X_test = sm.add_constant(X_test)
y_pred_regression = reg_model.predict(X_test)

# Add the ARIMA errors to the regression predictions
y_pred_arima = arima_model_fit.predict(start=len(train_data), end=len(train_data) + len(test_data) - 2)
y_pred_combined = y_pred_regression.reset_index(drop=True) + y_pred_arima.reset_index(drop=True)

# Step 7: Evaluate the model
y_test = test_data['visits'].reset_index(drop=True)

# Remove the last value from y_test and y_pred_combined
y_test = y_test[:-1]
y_pred_combined = y_pred_combined[:-1]

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred_combined)
print("Mean Squared Error:", mse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred_combined)
print("Mean Absolute Error:", mae)

# Calculate Mean Absolute Percentage Error (MAPE)
mape = np.mean(np.abs((y_test - y_pred_combined) / y_test)) * 100
print("Mean Absolute Percentage Error:", mape)

# Calculate R-squared (R2) score
r2 = r2_score(y_test, y_pred_combined)
print("R-squared Score:", r2)

I would appreciate guidance on how to perform a hyperparameter search to find the best p, d, and q values for the ARIMA model in order to improve the accuracy of my time series forecasting. Additionally, if there are alternative approaches or references that can help me enhance my forecasting results, I would be grateful for any suggestions.

Upvotes: 2

Views: 410

Answers (1)

Michael Grogan
Michael Grogan

Reputation: 1016

It seems that you are attempting to train the ARIMA model on the regression model itself as opposed to simply analysing visits in isolation with the ARIMA model.

I would attempt this approach first, as there is the risk that the explanatory variables are not adequately accounting for the variation in your time series. If visits shows clear seasonality patterns and a trend - then you could be able to simply forecast visits over time in its own right.

If the regression model is not doing a good job at forecasting visits - then neither will the ARIMA model when trained on the regression model itself, so I would not recommend this approach.

To understand your data better, I would suggest generating ACF and PACF plots as a way of better determining what the appropriate order for your ARIMA model would be. You might find this guide useful.

Upvotes: 0

Related Questions