How does `sklearn.model_selection.RandomizedSearchCV` work?

Question

I am making a binary classifier with unbalanced classes (of ratio 1:10). I tried KNN, RFs, and XGB classifier. I am getting the best precision-recall tradeoff and F1 score among them from XGB classifer(perhaps because size of dataset is very less - (1900,19))

So after checking error plots for XGB, i decided to go for RandomizedSearchCV() from sklearn for parameter tuning of my XGB classifier. Based on another answer on stackexchange, this is my code :

from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
score_arr = []
clf_xgb = XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': [50, 120, 180, 240, 400],
              'learning_rate': [0.01, 0.03, 0.05],
              'subsample': [0.5, 0.7],
              'max_depth': [3, 4, 5],
              'min_child_weight': [1, 2, 3], 
              'scale_pos_weight' : [9]
            }
clf = RandomizedSearchCV(clf_xgb, param_distributions = param_dist, n_iter = 25, scoring = 'precision', error_score = 0, verbose = 3, n_jobs = -1)
print(clf)
numFolds = 6
folds = StratifiedKFold(n_splits = numFolds, shuffle = True)

estimators = []
results = np.zeros(len(X))
score = 0.0
for train_index, test_index in folds.split(X_train, y_train):
    print(train_index)
    print(test_index)
    _X_train, _X_test = X.iloc[train_index,:], X.iloc[test_index,:]
    _y_train, _y_test = y.iloc[train_index].values.ravel(), y.iloc[test_index].values.ravel()
    clf.fit(_X_train, _y_train, eval_metric="error", verbose=True)

    estimators.append(clf.best_estimator_)
    results[test_index] = clf.predict(_X_test)
    score_arr.append(f1_score(_y_test, results[test_index]))
    score += f1_score(_y_test, results[test_index])
score /= numFolds

So RandomizedSearchCV actually selects the classifier and then in kfolds it got fit and predict result on the validation set. Note that i have given X_train and y_train in kfolds split, so that i have a seperate test dataset for testing the final algorithm.

Now, the problem is, if you actually looks the f1-score in each kfold iteration, it is like this score_arr = [0.5416666666666667, 0.4, 0.41379310344827586, 0.5, 0.44, 0.43478260869565216] .

But when I test clf.best_estimator_ as my model, on my test dataset, it gives f1-score of 0.80 and with {'precision': 0.8688524590163934, 'recall': 0.7571428571428571} precision and recall.

How come my score while validation is low and what has happened now on testset? Is my model correct or Did i missed something?

P.S. - Taking the parameters of clf.best_estimator_, i fitted them seperately on my training data using xgb.cv, then also the f1-score was near 0.55. I think this might be due to differences between training approaches of RandomizedSearchCV and xgb.cv. Please tell me if plots or more info needed.

Update : I am attaching error plots of train and test aucpr and classification accuracyfor the generated model. The plot is generated by running model.fit() only once (justifying the values of score_arr).

How does `sklearn.model_selection.RandomizedSearchCV` work?

Answers (1)

Related Questions