Many-Hot (N-hot) encoding - quick pandas approach?

Question

With one-hot encoding, once you have a column with 1 value in it, lets say "color", pandas get_dummies will do as follows:

df = pd.DataFrame({'f1': ['red', 'yellow']})
df
Out[24]: 
       f1
0     red
1  yellow

pd.get_dummies(df)
Out[25]: 
   f1_red  f1_yellow
0       1          0
1       0          1

A "many-hot encoding" problem will be where you might have a list of colors, like the following example:

df = pd.DataFrame({'f1': ['red', ['yellow', 'blue']]})
df
Out[27]: 
               f1
0             red
1  [yellow, blue]

Is there any graceful-smart-Pythonic way, hopefully supported in Pandas, that will yield me the following result:

   f1_red  f1_yellow  f1_blue
0       1          0        0
1       0          1        1

jezrael · Accepted Answer

You can join lists by | and then use str.get_dummies:

s = df['f1'].apply(lambda x: '|'.join(x) if isinstance(x, list) else x)

df = s.str.get_dummies()
print (df)

   blue  red  yellow
0     0    1       0
1     1    0       1

Another solution if performance is important:

s = df['f1'].apply(lambda x: x if isinstance(x, list) else [x])

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_)
print (df)
   blue  red  yellow
0     0    1       0
1     1    0       1

Many-Hot (N-hot) encoding - quick pandas approach?

Answers (1)

Related Questions