Reputation: 280
I am using Leave-One-Out-Cross-Validation on a Linear Regression model. Having 8869 observations, as a result of the following:
reg = LinearRegression()
list_Rs = cross_val_score(reg, X_34_const, y_34,
cv = len(y_34),
scoring = 'r2')
I should obtain a numpy array of 8869 values included between 0 and 1, with 8 decimals. The problem is that, in producing the result, Python automatically rounds all such values to 0.0:
array([0., 0., 0., ..., 0., 0., 0.])
while instead, for instance, if I use a 2-fold-cross-validation (which implies list_Rs beinga a numpy array with 2 values), it prints the correctly not rounded values:
list_Rs = cross_val_score(reg, X_34_const, y_34,
cv = 2,
scoring = 'r2')
which, printed, is:
array([0.16496198, 0.18115719])
This is not simply a printing representation, problem, since, for instance:
print(list_Rs[3] == 0)
returns True. This is for me a major problem since, in my computations, I will then need to put the values of list_Rs at the denominator of a fraction!
How can I solve the problem so to not have automatically rounded values also in my 8869 dimensional array?
Many thanks and I look forward to hearing from you.
Upvotes: 1
Views: 515
Reputation: 30856
Neither Python nor NumPy is doing any rounding here: scikit-learn's r2_score
scoring function (which is invoked under the hood when calling cross_val_score
with scoring='r2'
) is returning actual zeros.
That's because by using leave-one-out, each validation set consists of a single sample. So now for each fold of your cross validation, r2_score
is being called with a single observed value along with a single predicted value for that observation. And in that situation, it produces zero. For example:
>>> from sklearn.metrics import r2_score
>>> import numpy as np
>>> y_true = np.array([2.3])
>>> y_pred = np.array([2.1])
>>> r2_score(y_true, y_pred)
0.0
Here's the portion of the implementation where r2_score
ends up (somewhat arbitrarily) returning zero when evaluated on a single data point, assuming that the predicted value isn't an exact match for the observed value.
Arguably, r2_score
should be either raising an exception or producing negative infinity rather than zero here: the coefficient of determination uses the variance of the observed data as a normalising factor, and when there's only a single observation, that variance is zero, so the formula for the R2 score involves a division by zero. There's some discussion of this in a scikit-learn bug report.
Upvotes: 1