Reputation: 97
I was using simple logistic regression to predict a problem and trying to plot the precision_recall_curve and the roc_curve with predict_proba(X_test)
. I checked the docstring of predict_proba
but hadn't had much details on how it works. I was having bad input every time and checked that y_test
, predict_proba(X_test)
doesn't match. Finally discovered predict_proba()
produces 2 columns and people use the second.
It would be really helpful if someone can give an explanation how it produces two columns and their significance. TIA.
Upvotes: 9
Views: 11690
Reputation: 41
We can distinguish between the classifiers using the classifier classes. if the classifier name is model then model.classes_ will give the distinct classes.
Upvotes: 4
Reputation: 430
predict_proba()
produces output of shape (N, k) where N is the number of datapoints and k is the number of classes you're trying to classify. It seems you have two classes and hence you have 2 columns. Say your labels(classes) are ["healthy", "diabetes"], if a datapoint is predicted to have 80% chance of being diabetic and consequently 20% chance of being healthy, then your output row for that point will be [0.2, 0.8] to reflect these probabilities. In general you can go through the predicted array and get probabilities for the k-th class with model.predict_proba(X)[:,k-1]
As far as plotting you can do the following for precision_recall_curve:
predicted = logisticReg.predict_proba(X_test)
precision, recall, threshold = precision_recall_curve(y_test, predicted[:,1])
For ROC:
predicted = logisticReg.predict_proba(X_test)
fpr, tpr, thresholds = precision_recall_curve(y_test, predicted[:,1])
Notice that this will change for multi-label classification. You can find an example of that on the sklearn docs here
Upvotes: 9