Reputation: 115
I wasn't able to find information I am looking for so I will post my question here. I am just venturing into machine learning. I did my first multiple regression for a time series using scikit learn library. My code is as shown below
X = df[feature_cols]
y = df[['scheduled_amount']]
index= y.reset_index().drop('scheduled_amount', axis=1)
linreg = LinearRegression()
tscv = TimeSeriesSplit(max_train_size=None, n_splits=11)
li=[]
for train_index, test_index in tscv.split(X):
train = index.iloc[train_index]
train_start, train_end = train.iloc[0,0], train.iloc[-1,0]
test = index.iloc[test_index]
test_start, test_end = test.iloc[0,0], test.iloc[-1,0]
X_train, X_test = X[train_start:train_end], X[test_start:test_end]
y_train, y_test = y[train_start:train_end], y[test_start:test_end]
linreg.fit(X_train, y_train)
y_predict = linreg.predict(X_test)
print('RSS:' + str(linreg.score(X_test, y_test)))
y_test['predictec_amount'] = y_predict
y_test.plot()
Not that my data is a time series data and I want to keep the datetime index in my Dataframe when I'm fitting my model. I am using the TimeSeriesSplit for cross-validation. I still don't really understand the cross validation thing. First is there a need for a cross-validation in a time series dataset. Second should I use the last linear_coeff_ or should I get the average of all of them to use for my future prediction.
Upvotes: 1
Views: 1861
Reputation: 843
Yes there is a need for cross-validation in a timeseries dataset. Basically you need to ensure your model does not overfit your current test and is able to capture past seasonal changes so you can have some confidence in the model doing the same in the future. This method is also used to choose model hyperparameters (i.e. alpha in a Ridge regression).
In order to make future predictions, you should refit your regressor with the whole data and the best hyperparameters or, as @Marcus V. mentioned in the coments, maybe is best to train it only with the most recent data.
Upvotes: 1