Reputation: 3208
With one-hot encoding, once you have a column with 1 value in it, lets say "color", pandas get_dummies
will do as follows:
df = pd.DataFrame({'f1': ['red', 'yellow']})
df
Out[24]:
f1
0 red
1 yellow
pd.get_dummies(df)
Out[25]:
f1_red f1_yellow
0 1 0
1 0 1
A "many-hot encoding" problem will be where you might have a list of colors, like the following example:
df = pd.DataFrame({'f1': ['red', ['yellow', 'blue']]})
df
Out[27]:
f1
0 red
1 [yellow, blue]
Is there any graceful-smart-Pythonic way, hopefully supported in Pandas, that will yield me the following result:
f1_red f1_yellow f1_blue
0 1 0 0
1 0 1 1
Upvotes: 2
Views: 256
Reputation: 862751
You can join list
s by |
and then use str.get_dummies
:
s = df['f1'].apply(lambda x: '|'.join(x) if isinstance(x, list) else x)
df = s.str.get_dummies()
print (df)
blue red yellow
0 0 1 0
1 1 0 1
Another solution if performance is important:
s = df['f1'].apply(lambda x: x if isinstance(x, list) else [x])
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_)
print (df)
blue red yellow
0 0 1 0
1 1 0 1
Upvotes: 2