Reputation: 33

How to correctly implement StratifiedKFold with RandomizedSearchCV

I am trying to implement a Random Forest classifier using both stratifiedKFold and RandomizedSearchCV. The thing is that I can see that the "cv" parameter of RandomizedSearchCV is used to do the cross validation. But I do not understand how is this possible. I need to have the X_train, X_test, y_train, y_test data sets and, if I try to implement my code the way I have seen it, it is not possible to have the four sets... I have seen things like the following:

cross_val = StratifiedKFold(n_splits=split_number)
clf = RandomForestClassifier()
n_iter_search = 45
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                               n_iter=n_iter_search,
                               scoring=Fscorer, cv=cross_val,
                               n_jobs=-1)
random_search.fit(X, y)

But the thing is that I need to fit my data with the X_train and y_train data sets and predict the results with X_train and X_test data sets to be able to compare the results in the training data and in the testing data to evaluate the possible overfitting... This is a piece of my code, I know that I am doing the work twice but I dont know how to use correctly the stratifiedKfold and RandomizedSearchCV:

...
cross_val = StratifiedKFold(n_splits=split_number)
index_iterator = cross_val.split(features_dataframe, classes_dataframe)
clf = RandomForestClassifier()
random_grid = _create_hyperparameter_finetuning_grid()
clf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, n_iter = 100, cv = cross_val,
                                verbose=2, random_state=42, n_jobs = -1)
for train_index, test_index in index_iterator:
    X_train, X_test = np.array(features_dataframe)[train_index], np.array(features_dataframe)[test_index]
    y_train, y_test = np.array(classes_dataframe)[train_index], np.array(classes_dataframe)[test_index]
    clf_random.fit(X_train, y_train)
    clf_list.append(clf_random)
    y_train_pred = clf_random.predict(X_train)
    train_accuracy = np.mean(y_train_pred.ravel() == y_train.ravel())*100
    train_accuracy_list.append(train_accuracy)
    y_test_pred = clf_random.predict(X_test)
    test_accuracy = np.mean(y_test_pred.ravel() == y_test.ravel())*100

    confusion_matrix = pd.crosstab(y_test.ravel(), y_test_pred.ravel(), rownames=['Actual Cultives'],
                                   colnames=['Predicted Cultives'])
...

As you can see I am doing the work of the stratified K fold twice, (or that is what I think I am doing...) only to be able to get the four data sets which I need to evaluate my system. Thank you in advance for your help.

Upvotes: 3

Answers (3)

user13227382

Reputation:

params = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

Upvotes: 1

user13227382

Reputation:

cross_val = StratifiedKFold(n_splits=5)
index_iterator = cross_val.split(X_train, y_train)
clf = RandomForestClassifier()
clf_random = RandomizedSearchCV(estimator = clf, param_distributions = params, n_iter =100, cv = cross_val,
                            verbose=2, random_state=42, n_jobs = -1,scoring='roc_auc')
clf_random.fit(X, y)

Upvotes: 0

Anna Iliukovich-Strakovskaia

Reputation: 1433

RandomizedSearchCV is used to find best parameters for classifier. It chooses randomized parameters and fits your model with them. After that it needs to evaluate this model and you can choose strategy, it is cv parameter. Then with another parameters. You don't need to do it twice. You can just write:

cross_val = StratifiedKFold(n_splits=split_number)
index_iterator = cross_val.split(features_dataframe, classes_dataframe)
clf = RandomForestClassifier()
random_grid = _create_hyperparameter_finetuning_grid()
clf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, n_iter = 100, cv = cross_val,
                                verbose=2, random_state=42, n_jobs = -1)
clf_random.fit(X, y)

And all will be done automaticly. U should look at parameters like cv_results_ or best_estimator_ after that. If u don't want to search the best parameters for classifier - u shouldn't use RandomizedSearchCV. It just to do that.

And here is a good example.

UPD: Try to do this:

clf = RandomForestClassifier()
random_grid = _create_hyperparameter_finetuning_grid()
clf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, 
                                score = 'accuracy', n_iter = 100, 
                                cv = StratifiedKFold(n_splits=split_number),
                                verbose=2, random_state=42, n_jobs = -1)
clf_random.fit(X, y)
print(clf_random.cv_results_)

Is this what u want?

The cv_results_ shows u accuracy for train and test for all splits and for all itarations.

Upvotes: 4

How to correctly implement StratifiedKFold with RandomizedSearchCV

Answers (3)

Related Questions