Reputation: 731
Trying to wrap my head around how to implement an ARIMA model to produce (arguably) simple forecasts. Essentially what I'm looking to do is forecast this year's bookings up until the end of the year and export as a csv. Looking something like this:
date bookings
2017-01-01 438
2017-01-02 167
...
2017-12-31 45
2018-01-01 748
...
2018-11-29 223
2018-11-30 98
...
2018-12-30 73
2018-12-31 100
Where anything greater than today (28/11/18) is forecasted.
What I've tried to do:
This gives me my dataset, which is basically two columns, data on a daily basis for whole of 2017 and bookings:
import pandas as pd
import statsmodels.api as sm
# from statsmodels.tsa.arima_model import ARIMA
# from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'
df = pd.read_csv('data.csv',names = ["date","bookings"],index_col=0)
df.index = pd.to_datetime(df.index)
This is the 'modelling' bit:
X = df.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
for t in range(len(test)):
model = ARIMA(history, order=(1,1,0))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
# print('predicted=%f, expected=%f' % (yhat, obs))
#error = mean_squared_error(test, predictions)
#print(error)
#print('Test MSE: %.3f' % error)
# plot
plt.figure(num=None, figsize=(15, 8))
plt.plot(test)
plt.plot(predictions, color='red')
plt.show()
Exporting results to a csv:
df_forecast = pd.DataFrame(predictions)
df_test = pd.DataFrame(test)
result = pd.merge(df_test, df_forecast, left_index=True, right_index=True)
result.rename(columns = {'0_x': 'Test', '0_y': 'Forecast'}, inplace=True)
The trouble I'm having is:
What I think I need to do:
The how-to and why is the problem I'm having. Any help would be much appreciated
Upvotes: 1
Views: 1186
Reputation: 88236
Here are some thoughts:
Yes that is correct. The idea is the same as any Machine Learning model, the data is split in train/test, a model is fit using the train data, and the test is used to compare using some error metrics the actual model predictions with the real data. However as you are dealing with time series data, the train/test split must be performed respecting the time sequence, as you already do.
Do you actually have a csv with the 2018 data? All you need to do to split in train/test is the same as you do for the 2017 data, i.e keep up until some size as train, and leave the end to test your predictions train, test = X[0:size], X[size:len(X)]
. However, if what you want is a prediction of today's date onwards, why not use all historical data as input to the model and use that to forecast?
What I think I need to do
Why would you want to split it? Simply feed your ARIMA model all your data as a single time series sequence, thus appending both of your yearly data, and use the last size
samples as test. Take into account that the estimate gets better the larger the sample size. Once you've validated the performance of the model, use it to predict from today onwards.
Upvotes: 2