Difference between predict() and predict_proba() functions in scikit learn

Question

Greetings data science community! How's going? So, I'm studying classification Tree and scikit-learning and during my studyings i come across this "issue":

After training a tree (clf = DecisionTreeClassifier()) and training it (clf.fit(Xtrain, ytrain)) i have decided to test its performance on the training data itself (just to compare, later, with the test data, in terms of Sensitivity Specificity and ROC-AUC).
But instead to only apply the predict() I also applied the predict_proba() on the X_train data.

As you can se by the image, the observation 4 has 50 % of probability to give zero and 50% to give one (according to predict_proba() function) however the predict() function classified it as zero

Image with the dataframe where the first column is the result from predict_proba() function and the second column is the result from predict() column

Did the predict() function sort as ZERO by "chance" or since it's zero or one, does it sort as zero because it comes first (as if order matters)?

I could not solve my doubts when analyzing the documentation of the functions (source: https://github.com/scikit-learn/scikit-learn/blob/7f9bad99d/sklearn/tree/\_classes.py#L476)

Thanks in advance!

Maria K · Accepted Answer

In case of binary classification the returned class is computed as follows

proba = self.tree_.predict(X)
...
return self.classes_.take(np.argmax(proba, axis=1), axis=0)

Reference: sklearn code

So basicaly the choice falls on numpy.argmax.

Let's look in the numpy documentation and read the following:

Notes: In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.

So the final answer - in case of equal probabilities the first class is chosen always, which in case of binary classification corresponds to the negative label.

Difference between predict() and predict_proba() functions in scikit learn

Answers (1)

Related Questions