Reputation: 2017
I am using sklearn's GridSearchCV to get best parameters for my Random Forest Model.
Below is my code
model = RandomForestRegressor(random_state = 1, n_jobs = -1)
param_grid = {"n_estimators": [5, 10]}
for parameter, param_range in dict.items(param_grid):
#get_optimum_range(parameter, param_range, RFReg, index)
grid_search = GridSearchCV(estimator=model, param_grid = {parameter: param_range})
grid_search.fit(X_train, y_train)
results = pd.DataFrame(grid_search.cv_results_)
My results dataframe is as below
If you observe my mean_test_score
is negative but mean_train_score
is positive.
What could be the reason for the same ?
My dataframe sizes
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(538, 3)
(538,)
(112, 3)
(112,)
Upvotes: 1
Views: 6557
Reputation: 4639
Apart from the fact that R^2 can be negative (detailed in other answers), it's worth noting that the scoring API is implemented to always minimise values, so in case higher is better for the user-supplied scoring function, the sign is flipped, as explained in https://stackoverflow.com/a/27323356/6917627.
Upvotes: 0
Reputation: 172
In gridsearch CV if you don't specify any scorer the default scorer of the estimator (here RandomForestRegressor) is used: For Random Forest Regressor the default score is a R square score : it can also be called coefficient of determination.
Returns the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares > ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
R square is basically the percentage of variance explained by your model.
You can also see it as how much better is your regression compared to a simple model predicting always the same value (the mean) (so a line in 2D).
If your R square is negative that means your model is worse than a simple horizontal line, that means your model doesn't fit well your data.
In your case your train R^2 is pretty good so that either means you manage to overfit your data (but it's unlikely) or simply that the test data is not similar to the train data.
Upvotes: 2