Reputation: 180
I am making a binary classifier with unbalanced classes (of ratio 1:10). I tried KNN, RFs, and XGB classifier. I am getting the best precision-recall tradeoff and F1 score among them from XGB classifer(perhaps because size of dataset is very less - (1900,19)
)
So after checking error plots for XGB, i decided to go for RandomizedSearchCV()
from sklearn for parameter tuning of my XGB classifier. Based on another answer on stackexchange, this is my code :
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
score_arr = []
clf_xgb = XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': [50, 120, 180, 240, 400],
'learning_rate': [0.01, 0.03, 0.05],
'subsample': [0.5, 0.7],
'max_depth': [3, 4, 5],
'min_child_weight': [1, 2, 3],
'scale_pos_weight' : [9]
}
clf = RandomizedSearchCV(clf_xgb, param_distributions = param_dist, n_iter = 25, scoring = 'precision', error_score = 0, verbose = 3, n_jobs = -1)
print(clf)
numFolds = 6
folds = StratifiedKFold(n_splits = numFolds, shuffle = True)
estimators = []
results = np.zeros(len(X))
score = 0.0
for train_index, test_index in folds.split(X_train, y_train):
print(train_index)
print(test_index)
_X_train, _X_test = X.iloc[train_index,:], X.iloc[test_index,:]
_y_train, _y_test = y.iloc[train_index].values.ravel(), y.iloc[test_index].values.ravel()
clf.fit(_X_train, _y_train, eval_metric="error", verbose=True)
estimators.append(clf.best_estimator_)
results[test_index] = clf.predict(_X_test)
score_arr.append(f1_score(_y_test, results[test_index]))
score += f1_score(_y_test, results[test_index])
score /= numFolds
So RandomizedSearchCV
actually selects the classifier and then in kfolds it got fit and predict result on the validation set. Note that i have given X_train
and y_train
in kfolds split, so that i have a seperate test
dataset for testing the final algorithm.
Now, the problem is, if you actually looks the f1-score
in each kfold iteration, it is like this score_arr = [0.5416666666666667, 0.4, 0.41379310344827586, 0.5, 0.44, 0.43478260869565216]
.
But when I test clf.best_estimator_
as my model, on my test
dataset, it gives f1-score
of 0.80
and with {'precision': 0.8688524590163934, 'recall': 0.7571428571428571}
precision and recall.
How come my score while validation is low and what has happened now on testset? Is my model correct or Did i missed something?
P.S. - Taking the parameters of clf.best_estimator_
, i fitted them seperately on my training data using xgb.cv
, then also the f1-score
was near 0.55
. I think this might be due to differences between training approaches of RandomizedSearchCV
and xgb.cv
. Please tell me if plots or more info needed.
Update : I am attaching error plots of train and test aucpr
and classification accuracy
for the generated model. The plot is generated by running model.fit()
only once (justifying the values of score_arr
).
Upvotes: 1
Views: 3469
Reputation: 711
Randomized search on hyperparameters.
While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favorable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search:
A budget can be chosen independently of the number of parameters and possible values.
Adding parameters that do not influence the performance does not decrease efficiency.
If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used. It is highly recommended to use continuous distributions for continuous parameters.
for more (Reference ): SKLEARN documentation for RandomizedSearchCV
Upvotes: 1