Jesmar Scicluna
Jesmar Scicluna

Reputation: 23

Should I perform Cross Validation first and then do grid search?

I am new to the area in Machine Learning. My question is the following: I have built a model, and I am trying to optimize such model. By doing some research I found out that cross-validation could be used to help me avoid having an overfitted model. Moreover, Gridsearchcv could be used to help me optimize the parameters of such model and eventually identify the best possible parameters.

Now my question is should I do cross-validation first and then use grid search to identify the best parameters or using GridsearchCV would be enough given it performs cross-validation itself?

Upvotes: 2

Views: 2044

Answers (3)

Harshwardhan Nandedkar
Harshwardhan Nandedkar

Reputation: 253

As suggested by @Noki, You can use the cv parameter in Grid Search CV.

GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid='deprecated', 
refit=True, cv=None, verbose=0, 
pre_dispatch='2*n_jobs',error_score=nan,return_train_score=False)

Also the documentation clearly states that if it's a classification problem it will automatically ensure that it is stratified.

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

However, there is something that i would like to add: You can make your K-folds dynamic with respect to your value count of your Y_target variable. You cannot have the lowest count of your frequency in K-fold as 1, it will throw an error while training. I have happened to face this. Use the below code snippet to help you with that.

For example

import pandas as pd
Y_target=pd.Series([0,1,1,1,1,0,0,0,6,6,6,6,6,6,6,6,6])

if Y_target.value_counts().iloc[-1]<2:
    raise Exception ("No value can have frequency count as 1 in Y-target")
else:
    Kfold=Y_target.value_counts().iloc[-1]

You can then assign Kfold to your cv parameter in Grid Search

Upvotes: 2

felice
felice

Reputation: 1363

See Cross validation with test data set.

My recommendation, if your dataset is large enough:

  1. Split your dataset into training and test subsets.
  2. Perform a GridSearchCV on your training dataset.
  3. Evaluate your best model (from the GridSearchCV) on your test subset.

Upvotes: 3

Noki
Noki

Reputation: 943

Now my question is should I do cross-validation first and then use grid search to identify the best parameters or using GridsearchCV would be enough given it performs cross-validation itself?

The second. GridSearchCV uses cross-validation splitting strategy to select the best parameters. If you read the scikit-learn documentation, there is a parameter called "cv" and it defines 5-fold cross-validation by default. If you need to use another cross-validation strategy, you can give it an int, cross-validation generator or an iterable

Upvotes: 1

Related Questions