R-squared results of test and validation differ by a huge margin

Question

I am working on a regression problem with keras and tensorflow using a neural network. The data is split, so that 282774 datasets are for training, 70694 are for validation and 88367 are for testing. To evaluate my models I am printing out the mean squared error (MSE), the mean absolute error (MAE) and the R-squared score. These are some examples from the results I get:

               MSE           MAE             R-squared
Training       1.562072899   0.958128839     0.849787137
Validation     0.687871457   0.62066941      0.935365564
Test           0.683918759   0.618674863   -16.22829222

I do not understand the value for R-squared on test data. I know that R-squared can be negative, but how can it be that there is such a big difference between validation and test if both fall into the category of unseen data. Can someone give me a hint?

Some background information:

Since keras does not have the R-squared metric built in, I implemented it with some code I found on the web and which seems logical for me:

def r2_keras(y_true, y_pred):
    SS_res =  K.sum(K.square(y_true - y_pred)) 
    SS_tot = K.sum(K.square(y_true - K.mean(y_true))) 
    return ( 1 - SS_res/(SS_tot + K.epsilon()) )

And if it helps: this is my model:

model = Sequential()
model.add(Dense(75, input_shape=(7,)))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(1, activation='linear'))

adam = optimizers.Adam(lr=0.001)
model.compile(loss='mse',
              optimizer=adam,
              metrics=['mse', 'mae', r2_keras])

history = model.fit(x_train, y_train,
                    epochs=50,
                    batch_size=32,
                    validation_split=0.2)


score = model.evaluate(x_test, y_test, batch_size=32)

One strange thing I noticed is, that not all testing data seems to be considered. The console prints out the following:

86304/88367 [============================>.] - ETA: 0s-----

Maybe this leads to a miscalculation for R-squared?

I am thankful for any help/hint I can get on understanding this issue.

Update: I checked for outliers, but could not find any significant one. Min and max-values for test and train are close by, considering the standard deviation. Also the histograms look very much alike.

So in the next step I let my model predict the values for test data again and used pandas + numpy to calculate the r2_score. This time I got a value which is approximately equal to the r2_score for validation.

Below is how I did it. Do you see any flaws in the way I performed the calculation? (I just want to be sure that the old r2_score for "test" was indeed a calculation error)

# "test" is a dataframe with input data and the real outputs
# "inputs" is a list of the input column names
# The real/true outputs are contained in the column "output"
test['output_pred'] = model.predict(x=np.array(test[inputs]))
output_mean = test['output'].mean()    # Is this the correct mean value for r2 here?
test['SSres'] = np.square(test['output']-test['output_pred'])
test['SStot'] = np.square(test['output']-output_mean)
r2 = 1-(test['SSres'].sum()/(test['SStot'].sum()))

R-squared results of test and validation differ by a huge margin

Answers (1)

Related Questions