Reputation: 2457
Have a pandas dataframe with string input columns. df looks like:
news label1 label2 label3 label4
COVID Hospitalizations .... health
will pets contract covid.... health pets
High temperature will cause.. health weather
...
Expected output
news health pets weather tech
COVID Hospitalizations .... 1 0 0 0
will pets contract covid.... 1 1 0 0
High temperature will cause.. 1 0 1 0
...
Currently I used sklean
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df['labels'] = df[['label1','label2','label3','label4']].values.tolist()
mlb.fit(df['labels'])
temp = mlb.transform(df['labels'])
ff = pd.DataFrame(temp, columns = list(mlb.classes_))
df_final = pd.concat([df['news'],ff], axis=1)
this works so far.
Just wondering if there is a way to avoid to use sklearn.preprocessing.MultiLabelBinarizer
?
Upvotes: 2
Views: 189
Reputation: 862406
One idea is join values by |
and then use Series.str.get_dummies
:
#if missing values NaNs
#df = df.fillna('')
df_final = df.set_index('news').agg('|'.join, 1).str.get_dummies().reset_index()
print (df_final)
news health pets weather
0 COVID Hospitalizations .... 1 0 0
1 will pets contract covid.... 1 1 0
2 High temperature will cause.. 1 0 1
Or use get_dummies
:
df_final = (pd.get_dummies(df.set_index('news'), prefix='', prefix_sep='')
.groupby(level=0,axis=1)
.max()
.reset_index())
#second column name is empty string, so dfference with solution above
print (df_final)
news health pets weather
0 COVID Hospitalizations .... 1 1 0 0
1 will pets contract covid.... 1 1 1 0
2 High temperature will cause.. 1 1 0 1
Upvotes: 2