Marco
Marco

Reputation: 1235

Interpretation of train-validation loss of a Neural Network

I have trained an LSTM model for Time Series Forecasting. I have used an early stopping method with a patience of 150 epochs. I have used a dropout of 0.2, and this is the plot of train and validation loss: enter image description here

The early stopping method stop the training after 650 epochs, and save the best weight around epoch 460 where the validation loss was the best.

My question is : Is it normal that the train loss is always above the validation loss? I know that if it was the opposite(validation loss above the train) it would have been a sign of overfitting. But what about this case?

EDIT: My dataset is a Time Series with hourly temporal frequence. It is composed of 35000 instance. I have split the data into 80 % train and 20% validation but in temporal order. So for example the training will contain the data until the beginning of 2017 and the validation the data from 2017 until the end. I have created this plot by averaging the data over 15 days and this is the result:enter image description here

So maybe the reason is as you said that the validation data have an easier pattern. How can i solve this problem?

Upvotes: 0

Views: 913

Answers (2)

kerastf
kerastf

Reputation: 509

Usually the opposite is true. But since you are using drop out,it is common to have the validation loss less than the training loss.And like others have suggested try k-fold cross validation

Upvotes: 1

Van
Van

Reputation: 3767

For most cases, the validation loss should be higher than the training loss because the labels in the training set are accessible to the model. In fact, one good habit to train a new network is to use a small subset of the data and see whether the training loss can converge to 0 (fully overfits the training set). If not, it means this model is somehow incompetent to memorize the data.

Let's go back to your problem. I think the observation that validation loss is less than training loss happens. But this possibly is not because of your model, but how you split the data. Consider that there are two types of patterns (A and B) in the dataset. If you split in a way that the training set contains both pattern A and pattern B, while the small validation set only contains pattern B. In this case, if B is easier to be recognized, then you might get a higher training loss.

In a more extreme example, pattern A is almost impossible to recognize but there are only 1% of them in the dataset. And the model can recognize all pattern B. If the validation set happens to have only pattern B, then the validation loss will be smaller.

As alex mentioned, using K-fold is a good solution to make sure every sample will be used as both validation and training data. Also, printing out the confusion matrix to make sure all labels are relatively balanced is another method to try.

Upvotes: 2

Related Questions