Reputation: 536
I have categorical dataset like this:
Name | Color (Category) |
---|---|
Car | Red |
Grass | Green |
Sky | Blue |
Apple | Red,Green |
Photo | Black,White |
So one row can have one or few categories.
Also I'm using OneHotEncoder for categories:
data = asarray([['red'], ['green'], ['blue']])
print(data)
encoder = OneHotEncoder(sparse=False)
onehot = encoder.fit_transform(data)
print(onehot)
The output will be
[['red']
['green']
['blue']]
[[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]]
Is it possible to use OneHotEncoding for two or more categories?
Like Red,Green
convert to [0. 1. 1.]
?
I'm reading OneHotEncoder and tensorflow.keras.utils.to_categorical documentation and can't find such solution.
I met opinion somewhere, that I need to change my logic to this:
Name | Color (Category) (Remove) | Red | Green | Blue | Black | White |
---|---|---|---|---|---|---|
Car | //Red | True | False | False | False | False |
Grass | //Green | False | True | False | False | False |
Sky | //Blue | False | False | True | False | False |
Apple | //Red,Green | True | True | False | False | False |
Photo | //Black,White | False | False | False | True | True |
So just ignore Color column and make few outputs for Sequential model.
But isn't it exactly the same as converting Red,Green
to [0. 1. 1.]
?
I'm feeling that I miss something obvious, sorry if my question is dumb.
Upvotes: 1
Views: 1359
Reputation: 111
One hot encoding name is coming from single high bit and the others are low. So if you encode it as 011 it is not one hot encoding anymore. There is a binary encoding approach, but in that case, if you encode randomly like this
001 blue
010 red
011 green
100 blue red
Then it will be problematic for this scenario because red and green will share the second bit and they will affect each other whole training process.
So in order to arrange these sharing bits for contributing to the learning, the table you showed is a logical approach for this problem as below
001 blue
010 red
100 green
110 green red
111 green blue red
Upvotes: 1
Reputation: 1564
In your case, it is better to use MultiLabelBinarizer
from sklearn
. If df
is a dataframe with the data of your dataset, you can do something like that :
mlb = MultiLabelBinarizer()
mlb.fit_transform(df["Color (Category)"].str.split(","))
Link to sklearn documentation : https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
Upvotes: 2