Manal
Manal

Reputation: 75

Keras: model.prediction does not match model.evaluation loss

I applied this tutorial https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb (on a different dataset), the turorial did not compute the mean squared error from individual output, so I added the following line in the comparison function:

    mean_squared_error(signal_true,signal_pred)

but the loss and mse from the prediction were different from loss and mse from the model.evaluation on the test data. The errors from the model.evaluation (Loss, mae, mse) (test-set):

    [0.013499056920409203, 0.07980187237262726, 0.013792216777801514]

the error from individual target (outputs):

    Target0 0.167851388666284
    Target1 0.6068108648555771
    Target2 0.1710370357827747
    Target3 2.747463225418181
    Target4 1.7965991690103074
    Target5 0.9065426398192563 

I think it might a problem in training the model but i could not find where is it exactly. I would really appreciate your help.

thanks

Upvotes: 1

Views: 1400

Answers (2)

Shovalt
Shovalt

Reputation: 6776

I had the same problem and found a solution. Hopefully this is the same problem you encountered.

It turns out that model.predict doesn't return predictions in the same order generator.labels does, and that is why MSE was much larger when I attempted to calculate manually (using the scikit-learn metric function).

>>> model.evaluate(valid_generator, return_dict=True)['mean_squared_error']
13.17293930053711
>>> mean_squared_error(valid_generator.labels, model.predict(valid_generator)[:,0])
91.1225401637833

My quick and dirty solution:

valid_generator.reset()  # Necessary for starting from first batch
all_labels = []
all_pred = []
for i in range(len(valid_generator)):  # Necessary for avoiding infinite loop
    x = next(valid_generator)
    pred_i = model.predict(x[0])[:,0]
    labels_i = x[1]
    all_labels.append(labels_i)
    all_pred.append(pred_i)
    print(np.shape(pred_i), np.shape(labels_i))

cat_labels = np.concatenate(all_labels)
cat_pred = np.concatenate(all_pred)

The result:

>>> mean_squared_error(cat_labels, cat_pred)
13.172956865002352

This can be done much more elegantly, but was enough for me to confirm my hypothesis of the problem and regain some sanity.

Upvotes: 0

markemus
markemus

Reputation: 1814

There are a number of reasons that you can have differences between the loss for training and evaluation.

  • Certain ops, such as batch normalization, are disabled on prediction- this can make a big difference with certain architectures, although it generally isn't supposed to if you're using batch norm correctly.
  • MSE for training is averaged over the entire epoch, while evaluation only happens on the latest "best" version of the model.
  • It could be due to differences in the datasets if the split isn't random.
  • You may be using different metrics without realizing it.

I'm not sure exactly what problem you're running into, but it can be caused by a lot of different things and it's often difficult to debug.

Upvotes: 1

Related Questions