Blazej Kowalski
Blazej Kowalski

Reputation: 367

Logistic Regression in python. probability threshold

So I am approaching the classification problem with logistic regression algorithm and I obtain all of the predictions for the test set for class "1". The set is very imbalanced as it has over 200k inputs and more or less 92% are from class "1". Logistic regression generally classifies the input to class "1" if the P(Y=1|X)>0.5. So since all of the observations in test set are being classified into class 1 I thought that maybe there is a way to change this threshold and set it for example to 0.75 so that only observations with P(Y=1|X)>0.75 are classified to class 1 and otherwise class 0. How to implement it in python?

model= LogisticRegression(penalty='l2', C=1) 
model.fit(X_train, y_train)
score=accuracy_score(y_test, model2.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, model2.predict_proba(X_test)[:,1])
roc=roc_auc_score(y_test, model2.predict_proba(X_test)[:,1])
cr=classification_report(y_test, model2.predict(X_test))

PS. Since all the observations from test set are being classified to class 1 the F1 score and recall from classification report are 0. Maybe by changing the threshold this problem will be solved.

Upvotes: 0

Views: 4258

Answers (1)

Simon
Simon

Reputation: 5698

A thing you might want to try is balancing the classes instead of changing the threshold. Scikit-learn is supporting this via class_weights. For example you could try model = LogisticRegression(penalty='l2', class_weight='balanced', C=1). Look at the documentation for more details:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Upvotes: 2

Related Questions