Reputation: 1013
I am trying to use sklearn cross_val_score(). Following is the example I have tried:
# loocv evaluate random forest on the housing dataset
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# create loocv procedure
cv = LeaveOneOut()
# create model
model = RandomForestRegressor(random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))
The above code works fine without any problem. But, when I change scoring
into r2
, all values in scores
will become nan
.
Upvotes: 0
Views: 1033
Reputation: 4264
The problem is using LeaveOneOut()
in combination with r2
as scoring function. LeaveOneOut()
will split the data in such a way that only one sample is used for testing and the remaining is used for training. And here comes the problem, when you compute r2
on the validation set using this formula:
the denominator becomes zero since n=1
(only one sample to validate on) so y_bar = y_i
since the mean equals the one number that you have, this results in nan
which you observe. This is bound to happen if your cv = No. of data points
as shown below:
# evaluate model
scores = cross_val_score(model, X[0:10], y[0:10], scoring='r2', cv=10, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))
MAE: nan (nan)
And now when I set some other value for n
it works fine:
# evaluate model
scores = cross_val_score(model, X[0:10], y[0:10], scoring='r2', cv=3, n_jobs=-1)
# force positive
scores = absolute(scores)
# report performance
print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))
MAE: 0.662 (0.229)
Upvotes: 1