Bernardo
Bernardo

Reputation: 86

One-hot encoding for words which occur in multiple columns

I want to create on-hot encoded data from categorical data, which you can see here.

        Label1          Label2        Label3  
0   Street fashion        Clothing       Fashion
1         Clothing       Outerwear         Jeans
2     Architecture        Property      Clothing
3         Clothing           Black      Footwear
4            White      Photograph        Beauty

The problem (for me) is that one specific label (e.g. clothing) can be in label1, label2 or label 3. I tried pd.get_dummies but this created data like:

Label1_Clothing  Label2_Clothing    Label3_Clothing  
0      0                 1                 0
1      1                 0                 0
2      0                 0                 1

Is there a way to only have one dummy variable column for each label? So rather:

Label_Clothing  Label_Street Fashion    Label_Architecture  
0      1                 1                 0
1      1                 0                 0
2      1                 0                 1

I am pretty new to programming and would be very glad for your help.

Best, Bernardo

Upvotes: 1

Views: 177

Answers (1)

Cameron Riddell
Cameron Riddell

Reputation: 13407

You can stack your dataframe into a single Series then get the dummies from that. From there you take the maximum of the outer level to collapse the data back to its original shape while maintaining the position of the labels:

dummies = pd.get_dummies(df.stack()).max(level=0)

print(dummies)
   Architecture  Beauty  Black  Clothing  Fashion  Footwear  Jeans  Outerwear  Photograph  Property  Street fashion  White
0             0       0      0         1        1         0      0          0           0         0               1      0
1             0       0      0         1        0         0      1          1           0         0               0      0
2             1       0      0         1        0         0      0          0           0         1               0      0
3             0       0      1         1        0         1      0          0           0         0               0      0
4             0       1      0         0        0         0      0          0           1         0               0      1

Upvotes: 2

Related Questions