Lucy
Lucy

Reputation: 401

Input data creation for Multilabel classification

I am working on a multilabel classification problem. Every value in X is a categorical value. Original data is below

ID  X1  X2  X3  Y
111 AA  LL  KK  MMM
111 AA  LL  KK  MMM
111 BB  LL  jj  NNN
121 HH  DD  uu  III
121 HH  DD  yy  OOO
121 HH  LL  aa  PPP

I am trying to convert this to a dataframe where every unique value present in columns (X1, X2, X3, Y) will become a new column and every ID will have a single record. The expected output I am trying to get is

ID  X1_AA   X1_BB   X1_HH   X2_LL   X2_DD   X3_KK   X3_jj   X3_uu   X3_yy   x3_aa   Y_MMM   Y_NNN   Y_III   Y_OOO   Y_PPP
111 1   1   0   1   0   1   1   0   0   0   1   1   0   0   0
121 0   0   1   1   1   0   0   1   1   1   0   0   1   1   1

I tried using pandas get_dummies, it is creating dummy column, but id's are duplicated. Here Y is my target column. Multiple values of Y for an ID means ID has accessed multiple channels.

Also please suggest if I can directly use original data by creating dummy columns for X and Y in classification

Upvotes: 3

Views: 632

Answers (2)

Hryhorii Pavlenko
Hryhorii Pavlenko

Reputation: 3910

new_df = pd.get_dummies(df).groupby('ID').sum()
new_df[new_df > 1] = 1

ID  X1_AA   X1_BB   X1_HH   X2_DD   X2_LL   X3_KK   X3_aa   X3_jj   X3_uu   X3_yy   Y_III   Y_MMM   Y_NNN   Y_OOO   Y_PPP
111 1   1   0   0   1   1   0   1   0   0   0   1   1   0   0
121 0   0   1   1   1   0   1   0   1   1   1   0   0   1   1

Edit: I wasn't aware of .max() method in groupby. @jezrael's answer is definetely a better one.

Upvotes: 1

jezrael
jezrael

Reputation: 862441

For dummies in output is necessary aggregate max:

df1 = pd.get_dummies(df).groupby('ID', as_index=False).max()
print (df1)
    ID  X1_AA  X1_BB  X1_HH  X2_DD  X2_LL  X3_KK  X3_aa  X3_jj  X3_uu  X3_yy  \
0  111      1      1      0      0      1      1      0      1      0      0   
1  121      0      0      1      1      1      0      1      0      1      1   

   Y_III  Y_MMM  Y_NNN  Y_OOO  Y_PPP  
0      0      1      1      0      0  
1      1      0      0      1      1  

Upvotes: 2

Related Questions