Reputation: 35
I am getting a little confused about the notion of training, validation and test set and what exactly they should be used for.
My understanding is that we make a split in our data (for example 70% train, 10% validation, 20% test) and take the following steps:
In sklearn this would (roughly) be equal to (using ARIMA model as example):
model = ARIMA(endog=training_set, order=(x, x, x))
arima_model = model.fit()
validation_forecast = arima_model.forecast(steps=len(validation_set))
print('mse:', mse(validation_set, validation_forecast , axis=0))
test_forecast = arima_model.forecast(steps=len(test_set))
print('mse:', mse(test_set, test_forecast , axis=0))
Have i understood the process of training, validation and test sets of data correctly?
Also, how would i incorporate cross-validation in this scenario?
Upvotes: 0
Views: 669
Reputation: 1
I think you mostly understood correctly how to use them. Generally, the 3 sets are used as follow:
The training set is used to train the model and in order to extract the patterns and relationship in the data. The training is performed by adjusting the weights and parameters of the model to the input data.
The validation set is used to fine-tune the model by optimizing the hyper-parameters, such as learning rate... It is used regularly during training. Therefore, the validation set does affect the model final shape indirectly but is not as crucial as the train dataset.
The test set is used only once, at the end of training, to test the model on a completely new set of data that was never seen during training. It is often a public dataset that can be used to compare your performance to other model. It should contain enough variety in the data so that the model is thoroughly tested on various cases.
I am not sure about the prediction of the length of the validation and test period. Cross-validation consists in partitioning the training and test set into multiple folds and then train successively on those folds as such:
with sklearn, it can be done as so:
from sklearn.model_selection import cross_val_score
X, y = datasets.load_iris(return_X_y=True)
clf = svm.SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)
This will both fit and compute the score of the model for 5 consecutive partition of cross-validation.
Upvotes: 0