London guy
London guy

Reputation: 28022

Controlling the threshold in Logistic Regression in Scikit Learn

I am using the LogisticRegression() method in scikit-learn on a highly unbalanced data set. I have even turned the class_weight feature to auto.

I know that in Logistic Regression it should be possible to know what is the threshold value for a particular pair of classes.

Is it possible to know what the threshold value is in each of the One-vs-All classes the LogisticRegression() method designs?

I did not find anything in the documentation page.

Does it by default apply the 0.5 value as threshold for all the classes regardless of the parameter values?

Upvotes: 46

Views: 99326

Answers (6)

qwr
qwr

Reputation: 10891

For probabilistic classifiers such as logistic regression, the optimal Bayes estimator is

ŷ = argmaxy P(Y = y | X)

that is, the predicted class with the highest probability. For binary classification, this is equivalent to using probability 0.5 as a threshold.

As Nikita Astrakhantsev says, the threshold can be adjusted to control false positives and false negatives, therefore sensitivity/specificity, depending on the business (determined outside of statistics) requirements of the model. For highly unbalanced datasets, logistic regression may benefit from oversampling.

Upvotes: 0

veg2020
veg2020

Reputation: 1020

Yes, Sci-Kit learn is using a threshold of P>=0.5 for binary classifications. I am going to build on some of the answers already posted with two options to check this:

One simple option is to extract the probabilities of each classification using the output from model.predict_proba(test_x) segment of the code below along with class predictions (output from model.predict(test_x) segment of code below). Then, append class predictions and their probabilities to your test dataframe as a check.

As another option, one can graphically view precision vs. recall at various thresholds using the following code.

### Predict test_y values and probabilities based on fitted logistic 
regression model

pred_y=log.predict(test_x) 

probs_y=log.predict_proba(test_x) 
  # probs_y is a 2-D array of probability of being labeled as 0 
  # (first column of array) vs 1 (2nd column in array)


from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(test_y, probs_y[:, 
1]) 
   #retrieve probability of being 1(in second column of probs_y)
pr_auc = metrics.auc(recall, precision)

plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, precision[: -1], "b--", label="Precision")
plt.plot(thresholds, recall[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.ylim([0,1])

Upvotes: 23

jazib jamil
jazib jamil

Reputation: 552

There is a little trick that I use, instead of using model.predict(test_data) use model.predict_proba(test_data). Then use a range of values for thresholds to analyze the effects on the prediction;

pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
    print ('\n******** For i = {} ******'.format(i))
    Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
    test_accuracy = metrics.accuracy_score(Y_test.values.reshape(Y_test.values.size,1),
                                           Y_test_pred.iloc[:,1].values.reshape(Y_test_pred.iloc[:,1].values.size,1))
    print('Our testing accuracy is {}'.format(test_accuracy))

    print(confusion_matrix(Y_test.values.reshape(Y_test.values.size,1),
                           Y_test_pred.iloc[:,1].values.reshape(Y_test_pred.iloc[:,1].values.size,1)))

Best!

Upvotes: 37

hqp
hqp

Reputation: 19

If using @jazib jamil's and @Halee's solution in Pandas version 0.23.0+, replace .as_matrix() with .values (documentation).

pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
    print ('\n******** For i = {} ******'.format(i))
    Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
    test_accuracy = metrics.accuracy_score(Y_test.values.reshape(Y_test.values.size,1),
                                           Y_test_pred.iloc[:,1].values.reshape(Y_test_pred.iloc[:,1].values.size,1))
    print('Our testing accuracy is {}'.format(test_accuracy))

    print(confusion_matrix(Y_test.values.reshape(Y_test.values.size,1),
                           Y_test_pred.iloc[:,1].values.reshape(Y_test_pred.iloc[:,1].values.size,1)))

Upvotes: 0

Nabat Farsi
Nabat Farsi

Reputation: 982

we can use a wrapper as follows:

model = LogisticRegression()
model.fit(X, y)

def custom_predict(X, threshold):
    probs = model.predict_proba(X) 
    return (probs[:, 1] > threshold).astype(int)
    
    
new_preds = custom_predict(X=X, threshold=0.4) 

Upvotes: 8

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

Logistic regression chooses the class that has the biggest probability. In case of 2 classes, the threshold is 0.5: if P(Y=0) > 0.5 then obviously P(Y=0) > P(Y=1). The same stands for the multiclass setting: again, it chooses the class with the biggest probability (see e.g. Ng's lectures, the bottom lines).

Introducing special thresholds only affects in the proportion of false positives/false negatives (and thus in precision/recall tradeoff), but it is not the parameter of the LR model. See also the similar question.

Upvotes: 25

Related Questions