Reputation: 401
I am working on a multilabel classification problem. Every value in X is a categorical value. Original data is below
ID X1 X2 X3 Y
111 AA LL KK MMM
111 AA LL KK MMM
111 BB LL jj NNN
121 HH DD uu III
121 HH DD yy OOO
121 HH LL aa PPP
I am trying to convert this to a dataframe where every unique value present in columns (X1, X2, X3, Y) will become a new column and every ID will have a single record. The expected output I am trying to get is
ID X1_AA X1_BB X1_HH X2_LL X2_DD X3_KK X3_jj X3_uu X3_yy x3_aa Y_MMM Y_NNN Y_III Y_OOO Y_PPP
111 1 1 0 1 0 1 1 0 0 0 1 1 0 0 0
121 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1
I tried using pandas get_dummies, it is creating dummy column, but id's are duplicated. Here Y is my target column. Multiple values of Y for an ID means ID has accessed multiple channels.
Also please suggest if I can directly use original data by creating dummy columns for X and Y in classification
Upvotes: 3
Views: 632
Reputation: 3910
new_df = pd.get_dummies(df).groupby('ID').sum()
new_df[new_df > 1] = 1
ID X1_AA X1_BB X1_HH X2_DD X2_LL X3_KK X3_aa X3_jj X3_uu X3_yy Y_III Y_MMM Y_NNN Y_OOO Y_PPP
111 1 1 0 0 1 1 0 1 0 0 0 1 1 0 0
121 0 0 1 1 1 0 1 0 1 1 1 0 0 1 1
Edit: I wasn't aware of .max()
method in groupby. @jezrael's answer is definetely a better one.
Upvotes: 1
Reputation: 862441
For dummies in output is necessary aggregate max
:
df1 = pd.get_dummies(df).groupby('ID', as_index=False).max()
print (df1)
ID X1_AA X1_BB X1_HH X2_DD X2_LL X3_KK X3_aa X3_jj X3_uu X3_yy \
0 111 1 1 0 0 1 1 0 1 0 0
1 121 0 0 1 1 1 0 1 0 1 1
Y_III Y_MMM Y_NNN Y_OOO Y_PPP
0 0 1 1 0 0
1 1 0 0 1 1
Upvotes: 2