sklearn cross-validation R^2 score does not match with manual check using trained model on training and validation data

Question

Any help with the below question will be deeply appreciated. Below, X is the input descriptor (of size (10000, 72)) and Y is the output label, a column vector. The random-forest model is applied. To have a simple case, grid-search is over one iterator only and one cross-validation split is performed. Before training the model at the end, both the training and test (=validation data, to put it more accurately) data points are collected.

param_grid = {'randomforestregressor__min_samples_split':[5]}

clf = pipeline.make_pipeline(RandomForestRegressor(random_state=1))
cv = modsel.ShuffleSplit(n_splits=1, test_size=0.5, random_state=1)
gs = modsel.GridSearchCV(clf, cv=cv, param_grid=param_grid, scoring='r2', return_train_score=True, verbose=False)

for train_index, test_index in cv.split(X):
  Xtrain=X[train_index]; Ytrain=Y[train_index]
  Xtest=X[test_index]; Ytest=Y[test_index]

gs.fit(X, Y)
print(gs.cv_results_)

From the cv_results, the mean_train_score is 0.85863713 and mean_test_score (this should be validation score) is 0.41913632. The trained model is then applied on Xtrain and Xtest.

predictedYtrain=gs.best_estimator_.predict(Xtrain)
predictedYtest=gs.best_estimator_.predict(Xtest)

From predictedYtrain vs Ytrain or predictedYtest vs Ytest linear plot, I observed R^2 to be around 0.9 for both the cases. How is this the case? I was expecting to find ~ 0.85 and 0.42. Can someone please explain where the discrepancy is coming?

sklearn cross-validation R^2 score does not match with manual check using trained model on training and validation data

Answers (1)

Related Questions