123123roma
123123roma

Reputation: 31

R^2 (coefficient of deternimation) calculation using numpy and sklearn are giving different results

I need to calculate the coefficient of determination for a linear regression model.

And I got a strange thing, result of calculation using definition and numpy functions differs to sklearn.metrics.r2_score result. This code presents the difference :

import numpy as np
from sklearn.metrics import r2_score

y_true = np.array([2, -0.5, 2.5, 3, 0])
y_pred = np.array([2.5, 0.0, 3, 8, 0])

r2_score(y_true, y_pred)

>>> -1.6546391752577323
def my_r2_score(y_true, y_pred):
    return 1 - np.sum((y_true - y_pred) ** 2) / np.sum((np.average(y_true) - y_true) ** 2)

def my_r2_score_var(y_true, y_pred):
    return 1 - np.var(y_true - y_pred) / np.var(y_true)

print(my_r2_score(y_true, y_pred))
print(my_r2_score_var(y_true, y_pred))

>>>-1.6546391752577323
>>>-0.7835051546391754

Can any body explain this difference ?

Upvotes: 3

Views: 703

Answers (1)

timgeb
timgeb

Reputation: 78700

my_r2_score_var is wrong, because np.sum((y_true - y_pred) ** 2)/5 is not equal to np.var(y_true - y_pred).

>>> np.sum((y_true - y_pred) ** 2)/5
5.15
>>> np.var(y_true - y_pred)
3.46

What you are doing with np.var(y_true - y_pred) is:

>>> np.sum(((y_true - y_pred) - np.average(y_true - y_pred))**2)/5
3.46

np.sum((y_true - y_pred) ** 2) is the correct RSS.

You assumed np.var(y_true - y_pred) to be the mean RSS (RSS/5 here), but it isn't.

However, np.var(y_true) happens to be the mean TSS. So you got the RSS part of the 1 - RSS/TSS formula wrong.

Upvotes: 2

Related Questions