Nicg
Nicg

Reputation: 280

Python Numpy array (bad) automatic rounding

I am using Leave-One-Out-Cross-Validation on a Linear Regression model. Having 8869 observations, as a result of the following:

reg = LinearRegression()

list_Rs = cross_val_score(reg, X_34_const, y_34,
                      cv = len(y_34), 
                      scoring = 'r2')

I should obtain a numpy array of 8869 values included between 0 and 1, with 8 decimals. The problem is that, in producing the result, Python automatically rounds all such values to 0.0:

array([0., 0., 0., ..., 0., 0., 0.])

while instead, for instance, if I use a 2-fold-cross-validation (which implies list_Rs beinga a numpy array with 2 values), it prints the correctly not rounded values:

list_Rs = cross_val_score(reg, X_34_const, y_34,
                      cv = 2, 
                      scoring = 'r2')

which, printed, is:

array([0.16496198, 0.18115719])

This is not simply a printing representation, problem, since, for instance:

print(list_Rs[3] == 0)

returns True. This is for me a major problem since, in my computations, I will then need to put the values of list_Rs at the denominator of a fraction!

How can I solve the problem so to not have automatically rounded values also in my 8869 dimensional array?

Many thanks and I look forward to hearing from you.

Upvotes: 1

Views: 515

Answers (1)

Mark Dickinson
Mark Dickinson

Reputation: 30856

Neither Python nor NumPy is doing any rounding here: scikit-learn's r2_score scoring function (which is invoked under the hood when calling cross_val_score with scoring='r2') is returning actual zeros.

That's because by using leave-one-out, each validation set consists of a single sample. So now for each fold of your cross validation, r2_score is being called with a single observed value along with a single predicted value for that observation. And in that situation, it produces zero. For example:

>>> from sklearn.metrics import r2_score
>>> import numpy as np
>>> y_true = np.array([2.3])
>>> y_pred = np.array([2.1])
>>> r2_score(y_true, y_pred)
0.0

Here's the portion of the implementation where r2_score ends up (somewhat arbitrarily) returning zero when evaluated on a single data point, assuming that the predicted value isn't an exact match for the observed value.

Arguably, r2_score should be either raising an exception or producing negative infinity rather than zero here: the coefficient of determination uses the variance of the observed data as a normalising factor, and when there's only a single observation, that variance is zero, so the formula for the R2 score involves a division by zero. There's some discussion of this in a scikit-learn bug report.

Upvotes: 1

Related Questions