Mathews24
Mathews24

Reputation: 751

Encoding labels for multi-class problems in sckit-learn

When utilizing classifiers from scikit-learn for multi-class problems, is it necessary to encode the labels with one hot encoding? For example, I have 3 classes and simply labeled them as 0, 1, and 2 when feeding this data into the different classifiers for training. As far as I can tell, it seems to be working normally. But is there any reason this kind of basic encoding is not recommended?

Some algorithms, like random forests, handle categorical values natively. For methods such as logistic regression, multilayer perceptron, Gaussian naive Bayes, and random forest, the methods appear to handle categorical values natively, if I'm not mistaken. Is that assessment correct? Which of scikit-learn's classifiers do not handle these inputs natively and are influenced by ordinality?

Upvotes: 3

Views: 6701

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36619

All scikit estimators handle multi-class problems automatically.

Internally they will be converted to appropriately, either simple encoding to 0,1,2 etc if the algorithm supports native multi-class problems or one-hot encodings if the algorithm handles multi-class problems by transforming to binary.

Please refer to the documentation to see this:

All scikit-learn classifiers are capable of multiclass classification,...

You can see that "logistic regression, multilayer perceptron, Gaussian naive Bayes, and random forest" are under the heading "Inherently multiclass".

Others like SGD, or LinearSVC use one-vs-rest approach to handle multi-class, but that as I said above will be handled internally by scikit, so you as a user don't need to do anything and can pass multi-class labels (even as strings) in a single array of y to all classification estimators.

Only thing where the user needs to explicitly convert labels to one-hot encoding is the multi-label problem, where more than one label can be predicted for a sample. But I think your question is not about that.

Upvotes: 8

Related Questions