Reputation: 6372
I'm using scikit-learn for classification. Is there a way to get a probability vector that says how confident the classifier is for its prediction? I want a vector for the entire test set, not just for a single element. Basically I need that to compute ROC curve and AUC.
Upvotes: 1
Views: 1027
Reputation: 363577
Many classifiers have either a decision_function
method, or a predict_proba
method (or both) which can be used to obtain soft scores instead of hard decisions. Example:
>>> import numpy as np
>>> X = np.random.randn(10, 4)
>>> y = np.random.randint(0, 2, 10)
>>> from sklearn.svm import LinearSVC
>>> svm = LinearSVC().fit(X, y)
>>> svm.decision_function(X)
array([-0.92744332, 0.78697484, -0.71569751, -0.19938963, -0.15521737,
0.45962204, 0.1326111 , 0.44614422, 0.95731802, 0.8980536 ])
The values, in this case, are the signed distances from the hyperplane of a linear SVM. predict_proba
is slightly different in that it returns a matrix of probabilities, but you can get a vector of positive probabilities by indexing:
>>> from sklearn.linear_model import LogisticRegression
>>> lr = LogisticRegression().fit(X, y)
>>> lr.predict_proba(X)
array([[ 0.73987796, 0.26012204],
[ 0.26009545, 0.73990455],
[ 0.63918314, 0.36081686],
[ 0.62055698, 0.37944302],
[ 0.54361598, 0.45638402],
[ 0.38383357, 0.61616643],
[ 0.50740302, 0.49259698],
[ 0.39236783, 0.60763217],
[ 0.32553896, 0.67446104],
[ 0.20791651, 0.79208349]])
>>> lr.predict_proba(X)[:, 1]
array([ 0.26012204, 0.73990455, 0.36081686, 0.37944302, 0.45638402,
0.61616643, 0.49259698, 0.60763217, 0.67446104, 0.79208349])
Upvotes: 1
Reputation: 2753
If your only goal is to get the ROC curve and AUC, checkout out sklearn.metrics.roc_auc_score
here.
From the docs:
>>> import numpy as np
>>> from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75
Note that roc_auc_score is only suited for binary classification tasks -- if you're dealing with multiclass classification you'll likely have to compute separate roc_auc_score
values for each class.
Upvotes: 1
Reputation: 2532
In scikit-learn, for any classification method, you can turn on probability
option and use predict_proba
method to get the probability of the elements for each class. For example, using famous iris dataset,
from sklearn import svm
from sklearn import datasets
# train set
iris = datasets.load_iris()
X = iris.data[0::2, :2]
Y = iris.target[0::2]
clf = svm.SVC(probability=True)
clf.fit(X, Y)
# test set
Z = iris.data[1::2, :2]
Y_predict = clf.predict(Z)
Y_actual = iris.target[1::2]
Y_probas = clf.predict_proba(Z) # probabilities of each classification
Upvotes: 0