Jack Twain
Jack Twain

Reputation: 6372

Returning a probability vector for predications

I'm using scikit-learn for classification. Is there a way to get a probability vector that says how confident the classifier is for its prediction? I want a vector for the entire test set, not just for a single element. Basically I need that to compute ROC curve and AUC.

Upvotes: 1

Views: 1027

Answers (3)

Fred Foo
Fred Foo

Reputation: 363577

Many classifiers have either a decision_function method, or a predict_proba method (or both) which can be used to obtain soft scores instead of hard decisions. Example:

>>> import numpy as np
>>> X = np.random.randn(10, 4)
>>> y = np.random.randint(0, 2, 10)
>>> from sklearn.svm import LinearSVC
>>> svm = LinearSVC().fit(X, y)
>>> svm.decision_function(X)
array([-0.92744332,  0.78697484, -0.71569751, -0.19938963, -0.15521737,
        0.45962204,  0.1326111 ,  0.44614422,  0.95731802,  0.8980536 ])

The values, in this case, are the signed distances from the hyperplane of a linear SVM. predict_proba is slightly different in that it returns a matrix of probabilities, but you can get a vector of positive probabilities by indexing:

>>> from sklearn.linear_model import LogisticRegression
>>> lr = LogisticRegression().fit(X, y)
>>> lr.predict_proba(X)
array([[ 0.73987796,  0.26012204],
       [ 0.26009545,  0.73990455],
       [ 0.63918314,  0.36081686],
       [ 0.62055698,  0.37944302],
       [ 0.54361598,  0.45638402],
       [ 0.38383357,  0.61616643],
       [ 0.50740302,  0.49259698],
       [ 0.39236783,  0.60763217],
       [ 0.32553896,  0.67446104],
       [ 0.20791651,  0.79208349]])
>>> lr.predict_proba(X)[:, 1]
array([ 0.26012204,  0.73990455,  0.36081686,  0.37944302,  0.45638402,
        0.61616643,  0.49259698,  0.60763217,  0.67446104,  0.79208349])

Upvotes: 1

Madison May
Madison May

Reputation: 2753

If your only goal is to get the ROC curve and AUC, checkout out sklearn.metrics.roc_auc_score here.

From the docs:

>>> import numpy as np
>>> from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75

Note that roc_auc_score is only suited for binary classification tasks -- if you're dealing with multiclass classification you'll likely have to compute separate roc_auc_score values for each class.

Upvotes: 1

ysakamoto
ysakamoto

Reputation: 2532

In scikit-learn, for any classification method, you can turn on probability option and use predict_proba method to get the probability of the elements for each class. For example, using famous iris dataset,

from sklearn import svm
from sklearn import datasets

# train set
iris = datasets.load_iris()
X = iris.data[0::2, :2]  
Y = iris.target[0::2]

clf = svm.SVC(probability=True)
clf.fit(X, Y) 

# test set
Z = iris.data[1::2, :2]

Y_predict = clf.predict(Z)
Y_actual = iris.target[1::2]
Y_probas = clf.predict_proba(Z) # probabilities of each classification

Upvotes: 0

Related Questions