Reputation: 1910
I got a little confused when using models from sklearn, how do I set the specific optimization functions? for example, when RandomForestClassifier is used, how do I let the model 'know' that I want to maximize 'recall' or 'F1 score'. or 'AUC' instead of 'accuracy'?
Any suggestions? Thank you.
Upvotes: 1
Views: 2621
Reputation: 8801
What you are looking for is Parameter Tuning. Basically, first you select an estimator , then you define a hyper-parameter space (i.e. all possible parameters and their respective values that you want to tune), a cross validation scheme and scoring function. Now depending upon your choice of searching the parameter space, you can choose the following:
Exhaustive Grid Search In this approach, sklearn creates a grid of all possible combination of hyper-paramter values defined by the user using the GridSearchCV method. For instance, :
my_clf = DecisionTreeClassifier(random_state=0,class_weight='balanced')
param_grid = dict(
classifier__min_samples_split=[5,7,9,11],
classifier__max_leaf_nodes =[50,60,70,80],
classifier__max_depth = [1,3,5,7,9]
)
In this case, the grid specified is a cross-product of values of classifier__min_samples_split, classifier__max_leaf_nodes and classifier__max_depth. The documentation states that:
The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.
An example for using GridSearch :
#Create a classifier
clf = LogisticRegression(random_state = 0)
#Cross-validate the dataset
cv=StratifiedKFold(n_splits=n_splits).split(features,labels)
#Declare the hyper-parameter grid
param_grid = dict(
classifier__tol=[1.0,0.1,0.01,0.001],
classifier__C = np.power([10.0]*5,list(xrange(-3,2))).tolist(),
classifier__solver =['newton-cg', 'lbfgs', 'liblinear', 'sag'],
)
#Perform grid search using the classifier,parameter grid, scoring function and the cross-validated dataset
grid_search = GridSearchCV(clf, param_grid=param_grid, verbose=10,scoring=make_scorer(f1_score),cv=list(cv))
grid_search.fit(features.values,labels.values)
#To get the best score using the specified scoring function use the following
print grid_search.best_score_
#Similarly to get the best estimator
best_clf = grid_logistic.best_estimator_
print best_clf
You can read more about it's documentation here to know about the various internal methods, etc. to retrieve the best parameters, etc.
Randomized Search Instead of exhaustively checking for the hyper-parameter space, sklearn implements RandomizedSearchCV to do a randomized search over the paramters. The documentation states that:
RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values.
You can read more about it from here.
You can read more about other approaches here.
Alternative link for reference:
Edit: In your case, if you want to maximize the recall for the model, you simply specify recall_score from sklearn.metrics as the scoring function.
If you wish to maximize 'False Positive' as stated in your question, you can refer this answer to extract the 'False Positives' from the confusion matrix. Then use the make scorer function and pass it to the GridSearchCV object for tuning.
Upvotes: 4
Reputation: 1421
I would suggest you grab a cup of coffee and read (and understand) the following
http://scikit-learn.org/stable/modules/model_evaluation.html
You need to use something along the lines of
cross_val_score(model, X, y, scoring='f1')
possible choices are (check the docs)
['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score',
'average_precision', 'completeness_score', 'explained_variance',
'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted',
'fowlkes_mallows_score', 'homogeneity_score', 'mutual_info_score',
'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error',
'neg_mean_squared_log_error', 'neg_median_absolute_error',
'normalized_mutual_info_score', 'precision', 'precision_macro',
'precision_micro', 'precision_samples', 'precision_weighted', 'r2',
'recall', 'recall_macro', 'recall_micro', 'recall_samples',
'recall_weighted', 'roc_auc', 'v_measure_score']
Have fun Umberto
Upvotes: -3