Reputation: 3046
I have a question regarding GridSearchCV
:
by using this:
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, cv=6, scoring="f1")
I specify that k-fold cross-validation should be used with 6 folds right?
So that means that my corpus is split into training set and tet set 6 times.
Doesn't that mean that for the GridSearchCV
I need to use my entire corpus, like so:
gs_clf = gs_clf.fit(corpus.data, corpus.target)
And if so, how would I then get my trainig set from there used for the predict method?
predictions = gs_clf.predict(??)
I have seen code where the corpus is split into test set and training set using train_test_split
and then X_train
and Y_train
are passed to gs_clf.fit
.
But that doesn't make sense to me: If I split it the corpus beforehand, why use cross validation again in the GridSearchCV
?
Thanks for some clarification!!
Upvotes: 8
Views: 12370
Reputation: 186
Cross-validation and test percentile are different ways to measure the algorithm accuracy. Cross-validation does what you have said. Then, you must give all the data to the classifier. Splitting the data when using cross-validation makes simply no sense.
If you want to measure precision or recall using GridSearchCV
, you must create a scorer
and assign it to the scoring parameter of GridSearchCV
, like in this example:
>>> from sklearn.metrics import fbeta_score, make_scorer
>>> ftwo_scorer = make_scorer(fbeta_score, beta=2)
>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.svm import LinearSVC
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)
Upvotes: -1
Reputation: 1724
GridSearchCV
is not designed for measuring the performance of your model but to optimize the hyper-parameter of classifier while training. And when you write gs_clf.fit
you are actually trying different models on your entire data (but different folds) in the pursuit of the best hyper-parameter. For example, if you have n different c
's and m different gamma
's for an SVM model, then you have n X m models and you are searching (grid-search) through them to see which one works best on your data.gs_clf.best_params_
, then you can use your test data to get the actual performance (e.g., accuracy, precision, ...) of your model.corpus.train
and corpus.test
, and you should reserve corpus.test
only for the last round when you are done with training and you only want to test the final model.As we all know, any use of test data in the process of training the model (where training data should be used) or tuning the hyper-parameters (where the validation data should be used) is considered cheating and results in unrealistic performance.
Upvotes: 18