Interpreting training trace of a deep neural network: very low training loss and even lower validation loss

Question

I am a bit sceptical towards the following log , which I get when training a deep neural network for regression target values between -1.0 and 1.0, with a learning rate of 0.001 and 19200/4800 training/validation samples:

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
cropping2d_1 (Cropping2D)        (None, 138, 320, 3)   0           cropping2d_input_1[0][0]
____________________________________________________________________________________________________
lambda_1 (Lambda)                (None, 66, 200, 3)    0           cropping2d_1[0][0]
____________________________________________________________________________________________________
lambda_2 (Lambda)                (None, 66, 200, 3)    0           lambda_1[0][0]
____________________________________________________________________________________________________
convolution2d_1 (Convolution2D)  (None, 31, 98, 24)    1824        lambda_2[0][0]
____________________________________________________________________________________________________
spatialdropout2d_1 (SpatialDropo (None, 31, 98, 24)    0           convolution2d_1[0][0]
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D)  (None, 14, 47, 36)    21636       spatialdropout2d_1[0][0]
____________________________________________________________________________________________________
spatialdropout2d_2 (SpatialDropo (None, 14, 47, 36)    0           convolution2d_2[0][0]
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D)  (None, 5, 22, 48)     43248       spatialdropout2d_2[0][0]
____________________________________________________________________________________________________
spatialdropout2d_3 (SpatialDropo (None, 5, 22, 48)     0           convolution2d_3[0][0]
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D)  (None, 3, 20, 64)     27712       spatialdropout2d_3[0][0]
____________________________________________________________________________________________________
spatialdropout2d_4 (SpatialDropo (None, 3, 20, 64)     0           convolution2d_4[0][0]
____________________________________________________________________________________________________
convolution2d_5 (Convolution2D)  (None, 1, 18, 64)     36928       spatialdropout2d_4[0][0]
____________________________________________________________________________________________________
spatialdropout2d_5 (SpatialDropo (None, 1, 18, 64)     0           convolution2d_5[0][0]
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 1152)          0           spatialdropout2d_5[0][0]
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 1152)          0           flatten_1[0][0]
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 1152)          0           dropout_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 100)           115300      activation_1[0][0]
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 100)           0           dense_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 50)            5050        dropout_2[0][0]
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 10)            510         dense_2[0][0]
____________________________________________________________________________________________________
dropout_3 (Dropout)              (None, 10)            0           dense_3[0][0]
____________________________________________________________________________________________________
dense_4 (Dense)                  (None, 1)             11          dropout_3[0][0]
====================================================================================================
Total params: 252,219
Trainable params: 252,219
Non-trainable params: 0
____________________________________________________________________________________________________
None
Epoch 1/5
19200/19200 [==============================] - 795s - loss: 0.0292 - val_loss: 0.0128
Epoch 2/5
19200/19200 [==============================] - 754s - loss: 0.0169 - val_loss: 0.0120
Epoch 3/5
19200/19200 [==============================] - 753s - loss: 0.0161 - val_loss: 0.0114
Epoch 4/5
19200/19200 [==============================] - 723s - loss: 0.0154 - val_loss: 0.0100
Epoch 5/5
19200/19200 [==============================] - 1597s - loss: 0.0151 - val_loss: 0.0098

Both training an validation loss decrease, which is good news at first sight. But how can the training loss be so low already during the first epoch? And how can the validation loss be even lower? Is that an indication of a systematic error somewhere in my model or training setup?

Marcin Możejko · Accepted Answer

Actually - the validation loss which is smaller than training loss is not so rare phenomenon as one can think. It may occur e.g. when all examples in a validation data are well covered by examples from your training set and your network simply learnt the actual structure of your dataset.

It happens very often when the structure of your data is not very complexed. Actually - the small value of a loss after first epoch which suprised you might be a clue that this happened in your case.

In terms of loss being to small - you haven't specified what your loss is, but assuming that your task is a regression - I guessed its mse - and in this case a mean squared error at the level of 0.01 means that a mean Euclidean distance between the true value and the actual value is equal to 0.1 what is 5% of a diameter of your values set [-1, 1]. So - is this error actually so small?

You also haven't specified the number of batches which are analysed during one epoch. Maybe if the structure of your data is not so complexed and the batch size was small - one epoch was a sufficient amount of time to learn your data well.

In order to check if your model is trained well I advise you to plot a correlation plotwhen you plot y_pred on e.g. X-axis and y_true on Y-axis. Then you'll actually see how your model is actually trained.

EDIT : As Neil mentioned - there might be even more reasons behind the small validation error - like not a good separation of cases. I would also add - that beacuse of the fact - that 5 epoch comprised no more than 90 minutes - maybe it's good to check the result of the model by using a classic crossvalidation schema with e.g. 5 folds. This would assure you that in case of your dataset - your model is performing well.

Interpreting training trace of a deep neural network: very low training loss and even lower validation loss

Answers (1)

Related Questions