Reputation: 85
Any help with the below question will be deeply appreciated. Below, X is the input descriptor (of size (10000, 72)) and Y is the output label, a column vector. The random-forest model is applied. To have a simple case, grid-search is over one iterator only and one cross-validation split is performed. Before training the model at the end, both the training and test (=validation data, to put it more accurately) data points are collected.
param_grid = {'randomforestregressor__min_samples_split':[5]}
clf = pipeline.make_pipeline(RandomForestRegressor(random_state=1))
cv = modsel.ShuffleSplit(n_splits=1, test_size=0.5, random_state=1)
gs = modsel.GridSearchCV(clf, cv=cv, param_grid=param_grid, scoring='r2', return_train_score=True, verbose=False)
for train_index, test_index in cv.split(X):
Xtrain=X[train_index]; Ytrain=Y[train_index]
Xtest=X[test_index]; Ytest=Y[test_index]
gs.fit(X, Y)
print(gs.cv_results_)
From the cv_results, the mean_train_score is 0.85863713 and mean_test_score (this should be validation score) is 0.41913632. The trained model is then applied on Xtrain and Xtest.
predictedYtrain=gs.best_estimator_.predict(Xtrain)
predictedYtest=gs.best_estimator_.predict(Xtest)
From predictedYtrain vs Ytrain or predictedYtest vs Ytest linear plot, I observed R^2 to be around 0.9 for both the cases. How is this the case? I was expecting to find ~ 0.85 and 0.42. Can someone please explain where the discrepancy is coming?
Upvotes: 0
Views: 241
Reputation: 16049
You are not controlling the random state of the ShuffleSplit
object, so you are likely to get a different result each time. From the example you've posted it's not clear if the python interpreter is restarted between the training phase and the test phrase, but the fact that you are pickling makes me believe it is.
Try controlling the random state of your model:
cv = modsel.ShuffleSplit(n_splits=1, test_size=0.5, random_state=1)
or adjust the script so that it runs in one go, without stopping the interpreter
Upvotes: 0