Mohammad
Mohammad

Reputation: 1013

sklearn cross_val_score() returns NaN values when I use "r2" as scoring

I am trying to use sklearn cross_val_score(). Following is the example I have tried:

# loocv evaluate random forest on the housing dataset
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

# create loocv procedure
cv = LeaveOneOut()
# create model
model = RandomForestRegressor(random_state=1)

# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

The above code works fine without any problem. But, when I change scoring into r2, all values in scores will become nan.

Upvotes: 0

Views: 1033

Answers (1)

Parthasarathy Subburaj
Parthasarathy Subburaj

Reputation: 4264

The problem is using LeaveOneOut() in combination with r2 as scoring function. LeaveOneOut() will split the data in such a way that only one sample is used for testing and the remaining is used for training. And here comes the problem, when you compute r2 on the validation set using this formula:

enter image description here

the denominator becomes zero since n=1 (only one sample to validate on) so y_bar = y_i since the mean equals the one number that you have, this results in nan which you observe. This is bound to happen if your cv = No. of data points as shown below:

# evaluate model
scores = cross_val_score(model, X[0:10], y[0:10], scoring='r2', cv=10, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))
MAE: nan (nan)

And now when I set some other value for n it works fine:

# evaluate model
scores = cross_val_score(model, X[0:10], y[0:10], scoring='r2', cv=3, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))
MAE: 0.662 (0.229)

Upvotes: 1

Related Questions