newleaf
newleaf

Reputation: 2457

pandas dataframe label columns encoding

Have a pandas dataframe with string input columns. df looks like:

news                          label1      label2      label3  label4
COVID Hospitalizations ....   health
will pets contract covid....  health      pets
High temperature will cause.. health      weather
...

Expected output

news                          health      pets      weather  tech
COVID Hospitalizations ....   1           0         0        0 
will pets contract covid....  1           1         0        0
High temperature will cause.. 1           0         1        0
... 

Currently I used sklean

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df['labels'] = df[['label1','label2','label3','label4']].values.tolist()
mlb.fit(df['labels'])
temp = mlb.transform(df['labels'])
ff = pd.DataFrame(temp, columns = list(mlb.classes_))
df_final = pd.concat([df['news'],ff], axis=1)

this works so far. Just wondering if there is a way to avoid to use sklearn.preprocessing.MultiLabelBinarizer ?

Upvotes: 2

Views: 189

Answers (1)

jezrael
jezrael

Reputation: 862406

One idea is join values by | and then use Series.str.get_dummies:

#if missing values NaNs
#df = df.fillna('')
df_final = df.set_index('news').agg('|'.join, 1).str.get_dummies().reset_index()
print (df_final)
                            news  health  pets  weather
0    COVID Hospitalizations ....       1     0        0
1   will pets contract covid....       1     1        0
2  High temperature will cause..       1     0        1

Or use get_dummies:

df_final = (pd.get_dummies(df.set_index('news'), prefix='', prefix_sep='')
              .groupby(level=0,axis=1)
              .max()
              .reset_index())

#second column name is empty string, so dfference with solution above
print (df_final)
                            news     health  pets  weather
0    COVID Hospitalizations ....  1       1     0        0
1   will pets contract covid....  1       1     1        0
2  High temperature will cause..  1       1     0        1

Upvotes: 2

Related Questions