Reputation: 31
I need to calculate the coefficient of determination for a linear regression model.
And I got a strange thing, result of calculation using definition and numpy
functions differs to sklearn.metrics.r2_score
result.
This code presents the difference :
import numpy as np
from sklearn.metrics import r2_score
y_true = np.array([2, -0.5, 2.5, 3, 0])
y_pred = np.array([2.5, 0.0, 3, 8, 0])
r2_score(y_true, y_pred)
>>> -1.6546391752577323
def my_r2_score(y_true, y_pred):
return 1 - np.sum((y_true - y_pred) ** 2) / np.sum((np.average(y_true) - y_true) ** 2)
def my_r2_score_var(y_true, y_pred):
return 1 - np.var(y_true - y_pred) / np.var(y_true)
print(my_r2_score(y_true, y_pred))
print(my_r2_score_var(y_true, y_pred))
>>>-1.6546391752577323
>>>-0.7835051546391754
Can any body explain this difference ?
Upvotes: 3
Views: 703
Reputation: 78700
my_r2_score_var
is wrong, because np.sum((y_true - y_pred) ** 2)/5
is not equal to np.var(y_true - y_pred)
.
>>> np.sum((y_true - y_pred) ** 2)/5
5.15
>>> np.var(y_true - y_pred)
3.46
What you are doing with np.var(y_true - y_pred)
is:
>>> np.sum(((y_true - y_pred) - np.average(y_true - y_pred))**2)/5
3.46
np.sum((y_true - y_pred) ** 2)
is the correct RSS.
You assumed np.var(y_true - y_pred)
to be the mean RSS (RSS/5 here), but it isn't.
However, np.var(y_true)
happens to be the mean TSS. So you got the RSS part of the 1 - RSS/TSS
formula wrong.
Upvotes: 2