Rookie_123
Rookie_123

Reputation: 2017

GridsearchCV Negative Score

I am using sklearn's GridSearchCV to get best parameters for my Random Forest Model.

Below is my code

model = RandomForestRegressor(random_state = 1, n_jobs = -1) 
param_grid = {"n_estimators": [5, 10]}



for parameter, param_range in dict.items(param_grid):   
    #get_optimum_range(parameter, param_range, RFReg, index)

    grid_search = GridSearchCV(estimator=model, param_grid = {parameter: param_range})
    grid_search.fit(X_train, y_train)
    results = pd.DataFrame(grid_search.cv_results_)

My results dataframe is as below

enter image description here

If you observe my mean_test_score is negative but mean_train_score is positive.

What could be the reason for the same ?

My dataframe sizes

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(538, 3)
(538,)
(112, 3)
(112,)

Upvotes: 1

Views: 6557

Answers (2)

Lorenz Walthert
Lorenz Walthert

Reputation: 4639

Apart from the fact that R^2 can be negative (detailed in other answers), it's worth noting that the scoring API is implemented to always minimise values, so in case higher is better for the user-supplied scoring function, the sign is flipped, as explained in https://stackoverflow.com/a/27323356/6917627.

Upvotes: 0

user1672455
user1672455

Reputation: 172

In gridsearch CV if you don't specify any scorer the default scorer of the estimator (here RandomForestRegressor) is used: For Random Forest Regressor the default score is a R square score : it can also be called coefficient of determination.

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the residual sum of squares > ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

R square is basically the percentage of variance explained by your model.
You can also see it as how much better is your regression compared to a simple model predicting always the same value (the mean) (so a line in 2D).

If your R square is negative that means your model is worse than a simple horizontal line, that means your model doesn't fit well your data.
In your case your train R^2 is pretty good so that either means you manage to overfit your data (but it's unlikely) or simply that the test data is not similar to the train data.

Upvotes: 2

Related Questions