mbih
mbih

Reputation: 35

train, validation, test set in relation to evaluation metrics

I am getting a little confused about the notion of training, validation and test set and what exactly they should be used for.

My understanding is that we make a split in our data (for example 70% train, 10% validation, 20% test) and take the following steps:

  1. Initialize model (choose the model we want to use)
  2. Train model with the training data.
  3. Predict the length of the validation period using the model
  4. Evaluate metrics like MSE og RMSE by using the validation data
  5. Tune hyperparameters of the model to optimize a chosen metric (whatever chosen in 3.)
  6. With the optimal model, predict the length of the test period using the model
  7. Lastly, evaluate the metrics chosen again by using the test data. This will be the actual performance of the model if it were to be used in production.

In sklearn this would (roughly) be equal to (using ARIMA model as example):

  1. model = ARIMA(endog=training_set, order=(x, x, x))
  2. arima_model = model.fit()
  3. validation_forecast = arima_model.forecast(steps=len(validation_set))
  4. print('mse:', mse(validation_set, validation_forecast , axis=0))
  5. Tune model if necessary
  6. test_forecast = arima_model.forecast(steps=len(test_set))
  7. print('mse:', mse(test_set, test_forecast , axis=0))

Have i understood the process of training, validation and test sets of data correctly?

Also, how would i incorporate cross-validation in this scenario?

Upvotes: 0

Views: 669

Answers (1)

Dark Shadow
Dark Shadow

Reputation: 1

I think you mostly understood correctly how to use them. Generally, the 3 sets are used as follow:

The training set is used to train the model and in order to extract the patterns and relationship in the data. The training is performed by adjusting the weights and parameters of the model to the input data.

The validation set is used to fine-tune the model by optimizing the hyper-parameters, such as learning rate... It is used regularly during training. Therefore, the validation set does affect the model final shape indirectly but is not as crucial as the train dataset.

The test set is used only once, at the end of training, to test the model on a completely new set of data that was never seen during training. It is often a public dataset that can be used to compare your performance to other model. It should contain enough variety in the data so that the model is thoroughly tested on various cases.

I am not sure about the prediction of the length of the validation and test period. Cross-validation consists in partitioning the training and test set into multiple folds and then train successively on those folds as such:

cross-validation image

with sklearn, it can be done as so:

from sklearn.model_selection import cross_val_score

X, y = datasets.load_iris(return_X_y=True)
clf = svm.SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)

This will both fit and compute the score of the model for 5 consecutive partition of cross-validation.

Upvotes: 0

Related Questions