Reputation: 41
It is not clear how the function"GridSearchCV" breaks up the training and test sets.
Total 67959 lines with signs.
By default, the function"train_test_split" splits into a training 75% and a test 25%.
In training 50969 and test 16990 lines with signs.
I print the length of the array y_pred in the function"T_scorer", it turns 5662.
Along the way, I print the matrix of confusion.
If you add all the elements from the matrix, you get about 16990.
It turns out the test set is once again divided into training and test sets.
What am I doing wrong?
I need to test the set was 16990, and training 50969.
[[ 763 891]
[1216 2792]]
5662
[[2785 525]
[1578 6440]]
11328
The value of the matrix of confusion
_scorer = make_scorer(T_scorer)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
clf = RandomForestClassifier()
grid_searcher = GridSearchCV(clf, parameter_grid, verbose=20, scoring=_scorer)
grid_searcher.fit(X_test, y_test)
clf_best = grid_searcher.best_estimator_
print('Best params = ', clf_best.get_params())
Upvotes: 0
Views: 346
Reputation: 3591
By default, GridSearchCV does a 3-Fold validation, meaning it splits your data in 3 equal parts (1,2,3) and run the following sequence :
You don't have to use the train-test split here : just provide the X_train,y_train to gridsearchCV and let it work
You can also look at the "cv" part of the doc : http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
EDIT : here is the final code from the comments :
grid_searcher = GridSearchCV(clf, param_grid=parameter_grid, cv=StratifiedKFold(shuffle =True, random_state = 42))
Upvotes: 1