Use One-Hot encoding for few categories for one row

Question

I have categorical dataset like this:

Name	Color (Category)
Car	Red
Grass	Green
Sky	Blue
Apple	Red,Green
Photo	Black,White

So one row can have one or few categories.

Also I'm using OneHotEncoder for categories:

data = asarray([['red'], ['green'], ['blue']])
print(data)
encoder = OneHotEncoder(sparse=False)
onehot = encoder.fit_transform(data)
print(onehot)

The output will be

[['red']
 ['green']
 ['blue']]
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]

Is it possible to use OneHotEncoding for two or more categories?

Like Red,Green convert to [0. 1. 1.]?

I'm reading OneHotEncoder and tensorflow.keras.utils.to_categorical documentation and can't find such solution.

Another solution

I met opinion somewhere, that I need to change my logic to this:

Name	Color (Category) (Remove)	Red	Green	Blue	Black	White
Car	//Red	True	False	False	False	False
Grass	//Green	False	True	False	False	False
Sky	//Blue	False	False	True	False	False
Apple	//Red,Green	True	True	False	False	False
Photo	//Black,White	False	False	False	True	True

So just ignore Color column and make few outputs for Sequential model.

But isn't it exactly the same as converting Red,Green to [0. 1. 1.] ?

I'm feeling that I miss something obvious, sorry if my question is dumb.

Pierre-Loic · Accepted Answer

In your case, it is better to use MultiLabelBinarizer from sklearn. If df is a dataframe with the data of your dataset, you can do something like that :

mlb = MultiLabelBinarizer()
mlb.fit_transform(df["Color (Category)"].str.split(","))

Link to sklearn documentation : https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

Use One-Hot encoding for few categories for one row

Another solution

Answers (2)

Related Questions