Levsha
Levsha

Reputation: 536

Use One-Hot encoding for few categories for one row

I have categorical dataset like this:

Name Color (Category)
Car Red
Grass Green
Sky Blue
Apple Red,Green
Photo Black,White

So one row can have one or few categories.

Also I'm using OneHotEncoder for categories:

data = asarray([['red'], ['green'], ['blue']])
print(data)
encoder = OneHotEncoder(sparse=False)
onehot = encoder.fit_transform(data)
print(onehot)

The output will be

[['red']
 ['green']
 ['blue']]
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]

Is it possible to use OneHotEncoding for two or more categories?

Like Red,Green convert to [0. 1. 1.]?

I'm reading OneHotEncoder and tensorflow.keras.utils.to_categorical documentation and can't find such solution.

Another solution

I met opinion somewhere, that I need to change my logic to this:

Name Color (Category) (Remove) Red Green Blue Black White
Car //Red True False False False False
Grass //Green False True False False False
Sky //Blue False False True False False
Apple //Red,Green True True False False False
Photo //Black,White False False False True True

So just ignore Color column and make few outputs for Sequential model.

But isn't it exactly the same as converting Red,Green to [0. 1. 1.] ?

I'm feeling that I miss something obvious, sorry if my question is dumb.

Upvotes: 1

Views: 1359

Answers (2)

benan.akca
benan.akca

Reputation: 111

One hot encoding name is coming from single high bit and the others are low. So if you encode it as 011 it is not one hot encoding anymore. There is a binary encoding approach, but in that case, if you encode randomly like this

001 blue
010 red
011 green
100 blue red

Then it will be problematic for this scenario because red and green will share the second bit and they will affect each other whole training process.

So in order to arrange these sharing bits for contributing to the learning, the table you showed is a logical approach for this problem as below

    001 blue
    010 red
    100 green
    110 green red
    111 green blue red

Upvotes: 1

Pierre-Loic
Pierre-Loic

Reputation: 1564

In your case, it is better to use MultiLabelBinarizer from sklearn. If df is a dataframe with the data of your dataset, you can do something like that :

mlb = MultiLabelBinarizer()
mlb.fit_transform(df["Color (Category)"].str.split(","))

Link to sklearn documentation : https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

Upvotes: 2

Related Questions