Reputation: 3
Greetings data science community! How's going? So, I'm studying classification Tree and scikit-learning and during my studyings i come across this "issue":
After training a tree (clf = DecisionTreeClassifier()) and training it (clf.fit(Xtrain, ytrain)
) i have decided to test its performance on the training data itself (just to compare, later, with the test data, in terms of Sensitivity Specificity and ROC-AUC).
But instead to only apply the predict()
I also applied the predict_proba()
on the X_train data.
As you can se by the image, the observation 4 has 50 % of probability to give zero and 50% to give one (according to predict_proba()
function) however the predict()
function classified it as zero
Did the predict() function sort as ZERO by "chance" or since it's zero or one, does it sort as zero because it comes first (as if order matters)?
I could not solve my doubts when analyzing the documentation of the functions (source: https://github.com/scikit-learn/scikit-learn/blob/7f9bad99d/sklearn/tree/\_classes.py#L476)
Thanks in advance!
Upvotes: 0
Views: 221
Reputation: 2710
In case of binary classification the returned class is computed as follows
proba = self.tree_.predict(X)
...
return self.classes_.take(np.argmax(proba, axis=1), axis=0)
Reference: sklearn code
So basicaly the choice falls on numpy.argmax
.
Let's look in the numpy documentation and read the following:
Notes: In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.
So the final answer - in case of equal probabilities the first class is chosen always, which in case of binary classification corresponds to the negative label.
Upvotes: 0