dropWizard
dropWizard

Reputation: 3548

SciKit learn predict_proba - move threshold from .5 to something else

I'm new to pandas & scikit learn. I've been able to put together a simple model - Bad & Good

df = pd.read_csv('pandas_model.csv', header=None, names=['label', 'resume'])
X = df.resume.astype('U').values
y = df.label

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=1)

vect = TfidfVectorizer()

vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

## create test
X_test_dtm = vect.transform(X_test)

logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)

y_pred_class = logreg.predict(X_test_dtm)
score = metrics.accuracy_score(y_test, y_pred_class)
# print('LogReg Accuracy Score: ' % str(score))
print(score)
log_reg_cf = metrics.confusion_matrix(y_test, y_pred_class)
print(log_reg_cf)

Confusion Matrix:

[[2696  165]
 [ 742  424]]

It looks like it is guessing too many data points as "No" when they should be "Yes" (742).

I read that SciKit learn uses the .5 as the threshold to make a decision from the predict_proba() score.

I'm trying to put together way to "test" various thresholds - ie instead of .5, it's .4, which will move some of guessed data points from False Negative to being correctly guessed as Good.

logreg.predict_proba(X_test_dtm)

gives me a 2D array of the scores (Bad / Good)

array([[0.59946085, 0.40053915], ## guessed as bad, but if the threshold was .6, it would be guessed as good. This is what I'm trying to run simulations on
       [0.89679281, 0.10320719],
       [0.328435  , 0.671565  ],
       ...,
       [0.50415322, 0.49584678],
       [0.84380259, 0.15619741],
       [0.85216752, 0.14783248]])

y_test.head() gives me the true value (btw, what does 5369 represent? row number?)

5369      Bad
11313     Bad
11899    Good
3856      Bad
1961      Bad

Ideally I'm trying to run simulations to do something across all of the X_train_dtm data:

if X_train_dtm[0] (bad score) > .6 (instead of .5):
    then 
        resut = bad
    else
        result = good

and then re-check it against y_test() and re-check the accuracy score

It doesn't seem like there is anyway to move that .5 threshold in SciKit learn and looks like I have to do it manually.

Basically trying to make it "harder" for a data point to be guessed as no

Hopefully I've worded this question so it makes sense

I am getting an error from the question marked as a duplicate

from sklearn.metrics import precision_recall_curve
probs_y=logreg.predict_proba(X_test_dtm)
precision, recall, thresholds = precision_recall_curve(y_test, probs_y[:, 0])

ValueError: Data is not binary and pos_label is not specified

Upvotes: 0

Views: 3689

Answers (1)

Szymon Maszke
Szymon Maszke

Reputation: 24904

IIUC you could do it really simply (at least for binary code) with predict_proba:

probabilities = logreg.predict_proba(X_test_dtm)

threshold = 0.4
good = probabilities[:, 1]
predicted_good = good > threshold

This would give you a binary prediction for good case if it's probability is higher than 0.5.

You can easily generalize code above to test any threshold you like with whatever metric you like which requires binary prediction.

Upvotes: 4

Related Questions