Reputation: 31
I am working on a regression problem with keras and tensorflow using a neural network. The data is split, so that 282774 datasets are for training, 70694 are for validation and 88367 are for testing. To evaluate my models I am printing out the mean squared error (MSE), the mean absolute error (MAE) and the R-squared score. These are some examples from the results I get:
MSE MAE R-squared
Training 1.562072899 0.958128839 0.849787137
Validation 0.687871457 0.62066941 0.935365564
Test 0.683918759 0.618674863 -16.22829222
I do not understand the value for R-squared on test data. I know that R-squared can be negative, but how can it be that there is such a big difference between validation and test if both fall into the category of unseen data. Can someone give me a hint?
Some background information:
Since keras does not have the R-squared metric built in, I implemented it with some code I found on the web and which seems logical for me:
def r2_keras(y_true, y_pred):
SS_res = K.sum(K.square(y_true - y_pred))
SS_tot = K.sum(K.square(y_true - K.mean(y_true)))
return ( 1 - SS_res/(SS_tot + K.epsilon()) )
And if it helps: this is my model:
model = Sequential()
model.add(Dense(75, input_shape=(7,)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='linear'))
adam = optimizers.Adam(lr=0.001)
model.compile(loss='mse',
optimizer=adam,
metrics=['mse', 'mae', r2_keras])
history = model.fit(x_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.2)
score = model.evaluate(x_test, y_test, batch_size=32)
One strange thing I noticed is, that not all testing data seems to be considered. The console prints out the following:
86304/88367 [============================>.] - ETA: 0s-----
Maybe this leads to a miscalculation for R-squared?
I am thankful for any help/hint I can get on understanding this issue.
Update: I checked for outliers, but could not find any significant one. Min and max-values for test and train are close by, considering the standard deviation. Also the histograms look very much alike.
So in the next step I let my model predict the values for test data again and used pandas + numpy to calculate the r2_score. This time I got a value which is approximately equal to the r2_score for validation.
Below is how I did it. Do you see any flaws in the way I performed the calculation? (I just want to be sure that the old r2_score for "test" was indeed a calculation error)
# "test" is a dataframe with input data and the real outputs
# "inputs" is a list of the input column names
# The real/true outputs are contained in the column "output"
test['output_pred'] = model.predict(x=np.array(test[inputs]))
output_mean = test['output'].mean() # Is this the correct mean value for r2 here?
test['SSres'] = np.square(test['output']-test['output_pred'])
test['SStot'] = np.square(test['output']-output_mean)
r2 = 1-(test['SSres'].sum()/(test['SStot'].sum()))
Upvotes: 3
Views: 4488
Reputation: 11
Tensorflow's built-in evaluate method evaluates your test set batch by batch and hence calculates r2 at each batch. The metrics produced from model.evaluate() is then simple average of all r2 from each batch. While in model.fit(), r2 (and all metrics on validation set) are calculated per epoch (instead of per batch and then take avg.)
You may slice your output and output_pred into batches of the same batch size you used in model.evaluate() and calculate r2 on each batch. I guess the model produces high r2 on batches with high total sum of squares (SS_tot) and bad r2 on lower ones. So when taken average, result would be poor (however when calculate r2 on entire dataset, samples with higher ss_tot usually dominate the result).
Upvotes: 1