Reputation: 368
I have a classification problem with multiple classes, let's call them A, B, C and D. My data has the following shape:
X=[#samples, #features, 1], y=[#samples,1].
To be more specific, the y looks like this:
[['A'], ['B'], ['D'], ['A'], ['C'], ...]
When I train a Random Forest classifier on these labels, this works fine, however I read multiple times that class labels also need to be one hot encoded. After the one hot encoding, y is
[[1,0,0,0], [0,1,0,0], ...]
and has the shape
[#samples, 4]
The problem arises when I try to use this as classifier input. The model predicts every one of the four labels individually, meaning that it is also able to produce an output like [0 0 0 0], which I don't want. rfc.classes_
returns
# [array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1])]
How would I tell the model that the labels are one hot encoded instead of multiple labels which shall be predicted independently of each other? Do I need to change my y or do I need to alter some settings of the model?
Upvotes: 3
Views: 14301
Reputation: 994
Here you need not encode the labels , you can keep then as it is whether string or number as per my knowledge When using neural network you should consider one hot encoding / label encoding Example is in case of bbc classification data
model.predict(sample_data)
array(['entertainment'], dtype='<U13')
One hot encoding is mandatory in case of text data in training set : for example
name fuel type
baleno petrol
MG hector electric
after on hot encoding
name fuel type_petrol fuel_type_electric
baleno 1 0
MG hector 0 1
Upvotes: 0
Reputation: 3729
You don't have to make one hot encoding when using random forest in sklearn
.
What you need is "label encoder", and your Y should looks like
from sklearn.preprocessing import LabelEncoder
y = ["A","B","D","A","C"]
le = LabelEncoder()
le.fit_transform(y)
# array([0, 1, 3, 0, 2], dtype=int64)
I tried to modified the sample code sklearn provided :
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=1000, n_features=4,
... n_informative=2, n_redundant=0,
... random_state=0, shuffle=False)
y = np.random.choice(["A","B","C","D"],1000)
print(y.shape)
>>> clf = RandomForestClassifier(max_depth=2, random_state=0)
>>> clf.fit(X, y)
>>> clf.classes_
# array(['A', 'B', 'C', 'D'], dtype='<U1')
Either process the y with label encoding or without, it both worked with RandomForestClassifier
.
Upvotes: 3
Reputation: 2748
Your original approach, without one-hot encoding, was doing what you wanted.
One-hot encoding is meant for inputs to many models, but outputs for only a few (e.g. training a neural network with cross-entropy loss). So these are only needed for some algorithm implementations, while others can do fine without it.
For output labels, a classifier like RandomForest is just fine with strings and multiple classes.
Upvotes: 4