Vid Stropnik
Vid Stropnik

Reputation: 302

SKlearn classifier's predict_proba doesn't sum to 1

I have a classifier (in this case, it is the sklearn.MLPClassifier), with which I'm trying to perform classification into one of 18 classes.

The class is thus multi-class, not multi-label. I'm trying to predict only a single class.

I have my training data: X X.shape = (103393, 300) and Y Y.shape = (103393, 18), where the target Y is a one-hot encoded vetctor, denoting the target class.

EDIT in response to @Dr. Snoopy: I do not supply any labels -- I simply pass the 18-dimensional vector with the corret class' index corresponding to the 1 in the vector, and all others being 0 (One hot encoded vector). To prove that the vectors are correctly 1-hot encoded, I can run

import pandas as pd
pd.DataFrame(Y.sum(axis=1)).value_counts()

This returns 103393 counts of 1. Vectors are correctly 1-hot encoded, even upon examination.

When I fit the model, and return the class probability for all classes, the probability vector does not sum up to 1. Why might that be?

Here is an example of how I run the fitting:

from sklearn.neural_network import MLPClassifier

X_train, Y_train, X_test, Y_test = get_data()

model = MLPClassifier(max_iter=10000)
model.fit(X_train,Y_train)
probability_vector = model.predict_proba(X_test[0, :].respahe(1,-1))

Some of the time, the outputs are pretty close to 1. I suspect the error is probably due to rounding.

In other cases, the outputs sum to ~0.5 or less. Example output:

probability_vector = list(model.predict_proba(X_test[301,:].reshape(1,-1))[0])
print(probability_vector)
>>> [1.7591416e-06,
 3.148203e-05,
 3.9732524e-05,
 0.3810972,
 0.059248358,
 0.00032832936,
 8.5996935e-06,
 9.0914684e-05,
 9.377927e-07,
 0.0007674346,
 1.5543707e-06,
 0.0008467222,
 0.009655427,
 2.5728454e-05,
 1.07812774e-07,
 0.00022920035,
 0.00050288404,
 0.013878004]

len(probability_vecto)

>>> 18

sum(probability_vector)
>>> 0.46675437349917814


Why might this be happening? Is my model initialized incorrectly?

Note: A couple of possible reasons for the error & my comments on them:

  • Class imbalance: The classes in the dataset are indeed, imbalanced. However, the non-1 summation problem is happening in well represented classes too, not just the underrepresented ones. Could this be a consequence of a model, which is not expressive enough?

  • Model uncertainty: "The model may not have a high level of confidence in its predictions for every input. " Is that all it is?

Upvotes: 1

Views: 725

Answers (1)

Daraan
Daraan

Reputation: 3947

Do not one-hot encode your labels Y. If your labels have multiple dimensions it will do multi-label classification.

Just pass it as it is, the MLP classifier will do the encoding for you using LabelBinarizer and then it will apply the softmax function correctly. You find some more explanation in the docs.


You can check this for example by accessing model.out_activation_ or LabelBinarizer().fit(Y).y_type_ which should be "softmax/multiclass", but here it will be "logistic/multilabel-indicator"

What you get at the moment are the logistic outputs of the individual classes.

Upvotes: 4

Related Questions