IHaveAQuestion
IHaveAQuestion

Reputation: 153

Interpreting training loss/accuracy vs validation loss/accuracy

I have a few questions about interpreting the performance of certain optimizers on MNIST using a Lenet5 network and what does the validation loss/accuracy vs training loss/accuracy graphs tell us exactly. So everything is done in Keras using a standard LeNet5 network and it is ran for 15 epochs with a batch size of 128.

There are two graphs, train acc vs val acc and train loss vs val loss. I made 4 graphs because I ran it twice, once with validation_split = 0.1 and once with validation_data = (x_test, y_test) in model.fit parameters. Specifically the difference is shown here:

train = model.fit(x_train, y_train, epochs=15, batch_size=128, validation_data=(x_test,y_test), verbose=1)
train = model.fit(x_train, y_train, epochs=15, batch_size=128, validation_split=0.1, verbose=1)

These are the graphs I produced:

using validation_data=(x_test, y_test):

enter image description here

using validation_split=0.1:

enter image description here

So my two questions are:

1.) How do I interpret both the train acc vs val acc and train loss vs val acc graphs? Like what does it tell me exactly and why do different optimizers have different performances (i.e the graphs are different as well).

2.) Why do the graphs change when I use validation_split instead? Which one would be a better choice to use?

Upvotes: 4

Views: 7834

Answers (1)

xashru
xashru

Reputation: 3590

I will attempt to provide an answer

  1. You can see that towards the end training accuracy is slightly higher than validation accuracy and training loss is slightly lower than validation loss. This hints at overfitting and if you train for more epochs the gap should widen.

    Even if you use the same model with same optimizer you will notice slight difference between runs because weights are initialized randomly and randomness associated with GPU implementation. You can look here for how to address this issue.

    Different optimizers will usually produce different graph because they update model parameters differently. For example, vanilla SGD will do update at constant rate for all parameters and at all training steps. But if you add momentum the rate will depend on previous updates and usually will result in faster convergence. Which means you can achieve same accuracy as vanilla SGD in lower number of iteration.

  2. Graphs will change because training data will be changed if you split randomly. But for MNIST you should use standard test split provided with the dataset.

Upvotes: 3

Related Questions