Reputation: 13
I am working on a classification problem in which i want to find the "probability of an input being classified as [1,0]" and "not [1,0]"
I tried using predict_proba
method of SVC which gives the probability of class which I'm not looking for
from sklearn.svm import SVC
model = SVC(probability=True)
model.fit(final_data,foreclosure_y)
results = model.predict_proba(final_data_test)[0]
I expect my output to be like this
index,y
---------
0,0.45
1,0.62
2,0.43
3,0.12
4,0.55
Note: above output is in form .csv where y is the test_y
Here the column y is probabilities of each instance indexed from 0 to 4 that is could be classified as 0 or 1
For eg:- index 0 has probability 0.45 to be classified as 0 or 1
Upvotes: 0
Views: 1066
Reputation: 60321
Notice that
sum([0.58502114, 0.41497886])
# 1.0
predict_proba
gives the probabilities for both your classes (hence the array elements sum up to 1), in the order that they appear in model.classes_
; quoting from the docs (which are always your best friend in such situations):
Returns the probability of the sample for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute classes_.
Here is an example with toy data to illustrate the idea:
from sklearn.svm import SVC
model = SVC(probability=True)
X = [[1,2,3], [2,3,4]] # feature vectors
Y = [0, 1] # classes
model.fit(X, Y)
Let's now get the predicted probabilities for the first instance in the training set [1,2,3]
:
model.predict_proba(X)[0]
# array([0.39097541, 0.60902459])
OK, what is the order - i.e., which probability belongs to which class?
model.classes_
# array([0, 1])
So, this means that the probability for the instance belonging to class 0
is the first element of the array 0.39097541
, while the probability for belonging to class 1
is the second element 0.60902459
; and again, they sum up to 1, as expected:
sum([0.39097541, 0.60902459])
# 1.0
UPDATE
Now, in outputs such as the one you require, we don't put both probabilities; by convention, and for binary classification, we only include the probability for each instance belonging to class 1; here is how we can do it for the toy dataset X
shown above of only 2 instances:
pred = model.predict_proba(X)
pred
# array([[ 0.39097541, 0.60902459],
# [ 0.60705475, 0.39294525]])
import pandas as pd
out = pd.DataFrame(pred[:,1],columns=['y']) # keep only the second element of the arrays in pred, i.e. the probability for class 1
print(out)
Result:
y
0 0.609025
1 0.392945
Upvotes: 2