Md. Rezaul Karim
Md. Rezaul Karim

Reputation: 97

How predict_proba in sklearn produces two columns? what are their significance?

I was using simple logistic regression to predict a problem and trying to plot the precision_recall_curve and the roc_curve with predict_proba(X_test). I checked the docstring of predict_proba but hadn't had much details on how it works. I was having bad input every time and checked that y_test, predict_proba(X_test) doesn't match. Finally discovered predict_proba() produces 2 columns and people use the second.

It would be really helpful if someone can give an explanation how it produces two columns and their significance. TIA.

Upvotes: 9

Views: 11690

Answers (2)

Shibendu
Shibendu

Reputation: 41

We can distinguish between the classifiers using the classifier classes. if the classifier name is model then model.classes_ will give the distinct classes.

Upvotes: 4

Turtalicious
Turtalicious

Reputation: 430

predict_proba() produces output of shape (N, k) where N is the number of datapoints and k is the number of classes you're trying to classify. It seems you have two classes and hence you have 2 columns. Say your labels(classes) are ["healthy", "diabetes"], if a datapoint is predicted to have 80% chance of being diabetic and consequently 20% chance of being healthy, then your output row for that point will be [0.2, 0.8] to reflect these probabilities. In general you can go through the predicted array and get probabilities for the k-th class with model.predict_proba(X)[:,k-1]

As far as plotting you can do the following for precision_recall_curve:

predicted = logisticReg.predict_proba(X_test)
precision, recall, threshold = precision_recall_curve(y_test, predicted[:,1])

For ROC:

predicted = logisticReg.predict_proba(X_test)
fpr, tpr, thresholds = precision_recall_curve(y_test, predicted[:,1])

Notice that this will change for multi-label classification. You can find an example of that on the sklearn docs here

Upvotes: 9

Related Questions