user6396
user6396

Reputation: 1910

binary classification target specifically on false positive

I got a little confused when using models from sklearn, how do I set the specific optimization functions? for example, when RandomForestClassifier is used, how do I let the model 'know' that I want to maximize 'recall' or 'F1 score'. or 'AUC' instead of 'accuracy'?

Any suggestions? Thank you.

Upvotes: 1

Views: 2621

Answers (2)

Gambit1614
Gambit1614

Reputation: 8801

What you are looking for is Parameter Tuning. Basically, first you select an estimator , then you define a hyper-parameter space (i.e. all possible parameters and their respective values that you want to tune), a cross validation scheme and scoring function. Now depending upon your choice of searching the parameter space, you can choose the following:

Exhaustive Grid Search In this approach, sklearn creates a grid of all possible combination of hyper-paramter values defined by the user using the GridSearchCV method. For instance, :

my_clf = DecisionTreeClassifier(random_state=0,class_weight='balanced')
param_grid = dict(
            classifier__min_samples_split=[5,7,9,11],
            classifier__max_leaf_nodes =[50,60,70,80],
            classifier__max_depth = [1,3,5,7,9]
            )

In this case, the grid specified is a cross-product of values of classifier__min_samples_split, classifier__max_leaf_nodes and classifier__max_depth. The documentation states that:

The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.

An example for using GridSearch :

#Create a classifier 
clf = LogisticRegression(random_state = 0)

#Cross-validate the dataset
cv=StratifiedKFold(n_splits=n_splits).split(features,labels)

#Declare the hyper-parameter grid
param_grid = dict(
            classifier__tol=[1.0,0.1,0.01,0.001],
              classifier__C = np.power([10.0]*5,list(xrange(-3,2))).tolist(),
              classifier__solver =['newton-cg', 'lbfgs', 'liblinear', 'sag'],

             )

#Perform grid search using the classifier,parameter grid, scoring function and the cross-validated dataset
grid_search = GridSearchCV(clf, param_grid=param_grid, verbose=10,scoring=make_scorer(f1_score),cv=list(cv))

grid_search.fit(features.values,labels.values)

#To get the best score using the specified scoring function use the following
print grid_search.best_score_

#Similarly to get the best estimator
best_clf = grid_logistic.best_estimator_
print best_clf

You can read more about it's documentation here to know about the various internal methods, etc. to retrieve the best parameters, etc.

Randomized Search Instead of exhaustively checking for the hyper-parameter space, sklearn implements RandomizedSearchCV to do a randomized search over the paramters. The documentation states that:

RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values.

You can read more about it from here.

You can read more about other approaches here.

Alternative link for reference:

Edit: In your case, if you want to maximize the recall for the model, you simply specify recall_score from sklearn.metrics as the scoring function.

If you wish to maximize 'False Positive' as stated in your question, you can refer this answer to extract the 'False Positives' from the confusion matrix. Then use the make scorer function and pass it to the GridSearchCV object for tuning.

Upvotes: 4

Umberto
Umberto

Reputation: 1421

I would suggest you grab a cup of coffee and read (and understand) the following

http://scikit-learn.org/stable/modules/model_evaluation.html

You need to use something along the lines of

cross_val_score(model, X, y, scoring='f1')

possible choices are (check the docs)

['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 
'average_precision', 'completeness_score', 'explained_variance', 
'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 
'fowlkes_mallows_score', 'homogeneity_score', 'mutual_info_score', 
'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 
'neg_mean_squared_log_error', 'neg_median_absolute_error', 
'normalized_mutual_info_score', 'precision', 'precision_macro', 
'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 
'recall', 'recall_macro', 'recall_micro', 'recall_samples', 
'recall_weighted', 'roc_auc', 'v_measure_score']

Have fun Umberto

Upvotes: -3

Related Questions