user287629
user287629

Reputation: 41

It is not clear how the function"GridSearchCV" breaks up the training and test sets

It is not clear how the function"GridSearchCV" breaks up the training and test sets.
Total 67959 lines with signs. By default, the function"train_test_split" splits into a training 75% and a test 25%.
In training 50969 and test 16990 lines with signs. I print the length of the array y_pred in the function"T_scorer", it turns 5662. Along the way, I print the matrix of confusion. If you add all the elements from the matrix, you get about 16990. It turns out the test set is once again divided into training and test sets.
What am I doing wrong? I need to test the set was 16990, and training 50969.

[[ 763  891]
 [1216 2792]]
5662
[[2785  525]
 [1578 6440]]
11328

The value of the matrix of confusion

_scorer = make_scorer(T_scorer)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
clf = RandomForestClassifier()
grid_searcher = GridSearchCV(clf, parameter_grid, verbose=20, scoring=_scorer)
grid_searcher.fit(X_test, y_test)
clf_best = grid_searcher.best_estimator_
print('Best params = ', clf_best.get_params())

Upvotes: 0

Views: 346

Answers (1)

CoMartel
CoMartel

Reputation: 3591

By default, GridSearchCV does a 3-Fold validation, meaning it splits your data in 3 equal parts (1,2,3) and run the following sequence :

  • train on 1,2 --> test en 3
  • train on 2,3 --> test on 1
  • train on 1,3 --> test on 2

You don't have to use the train-test split here : just provide the X_train,y_train to gridsearchCV and let it work

You can also look at the "cv" part of the doc : http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

EDIT : here is the final code from the comments :

grid_searcher = GridSearchCV(clf, param_grid=parameter_grid, cv=StratifiedKFold(shuffle =True, random_state = 42))

Upvotes: 1

Related Questions