DanielTheRocketMan
DanielTheRocketMan

Reputation: 3249

cross Validation in Sklearn using a Custom CV

I am dealing with a binary classification problem.

I have 2 lists of indexes listTrain and listTest, which are partitions of the training set (the actual test set will be used only later). I would like to use the samples associated with listTrain to estimate the parameters and the samples associated with listTest to evaluate the error in a cross validation process (hold out set approach).

However, I am not be able to find the correct way to pass this to the sklearn GridSearchCV.

The documentation says that I should create "An iterable yielding (train, test) splits as arrays of indices". However, I do not know how to create this.

grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = custom_cv, n_jobs = -1, verbose = 0,scoring=errorType)

So, my question is how to create custom_cv based on these indexes to be used in this method?

X and y are respectivelly the features matrix and y is the vector of labels.

Example: Supose that I only have one hyperparameter alpha that belongs to the set{1,2,3}. I would like to set alpha=1, estimate the parameters of the model (for instance the coefficients os a regression) using the samples associated with listTrain and evaluate the error using the samples associated with listTest. Then I repeat the process for alpha=2 and finally for alpha=3. Then I choose the alpha that minimizes the error.

Upvotes: 0

Views: 1546

Answers (1)

JimmyOnThePage
JimmyOnThePage

Reputation: 965

EDIT: Actual answer to question. Try passing cv command a generator of the indices:

def index_gen(listTrain, listTest):
    yield listTrain, listTest

grid_search = GridSearchCV(estimator = model, param_grid = 
    param_grid,cv = index_gen(listTrain, listTest), n_jobs = -1, 
    verbose = 0,scoring=errorType)

EDIT: Before Edits:

As mentioned in the comment by desertnaut, what you are trying to do is bad ML practice, and you will end up with a biased estimate of the generalisation performance of the final model. Using the test set in the manner you're proposing will effectively leak test set information into the training stage, and give you an overestimate of the model's capability to classify unseen data. What I suggest in your case:

grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = 5, 
    n_jobs = -1, verbose = 0,scoring=errorType)

grid_search.fit(x[listTrain], y[listTrain]

Now, your training set will be split into 5 (you can choose the number here) folds, trained using 4 of those folds on a specific set of hyperparameters, and tested the fold that was left out. This is repeated 5 times, till all of your training examples have been part of a left out set. This whole procedure is done for each hyperparameter setting you are testing (5x3 in this case)

grid_search.best_params_ will give you a dictionary of the parameters that performed the best over all 5 folds. These are the parameters that you use to train your final classifier, using again only the training set:

clf = LogisticRegression(**grid_search.best_params_).fit(x[listTrain], 
    y[listTrain])

Now, finally your classifier is tested on the test set and an unbiased estimate of the generalisation performance is given:

predictions = clf.predict(x[listTest])

Upvotes: 1

Related Questions