Reputation: 558
As an example to the problem, suppose we have a dataframe:
Name Class
0 Aci FB
1 Dan TWT
2 Ann GRS
3 Aci GRS
4 Dan FB
The resulted dataframe would be df
Name FB TWT GRS
0 Aci 1 0 1
0 Dan 1 1 0
0 Ann 0 0 1
Upvotes: 2
Views: 666
Reputation: 765
pd.get_dummies(df1,columns=['Class']).groupby('Name').sum().rename(columns={'Class_FB':'FB','Class_GRS':'GRS','Class_TWT':'TWT'})
FB GRS TWT
Name
Aci 1 1 0
Ann 0 1 0
Dan 1 0 1
Upvotes: 0
Reputation: 591
The other solutions were failing for me due to memory overflows when I have 315 samples and 1908 labels. Here are some more performant methods.
Using pandas pivot:
def multihotencode(data, samples_col, labels_col):
data = data.copy()
data['present'] = 1
multihot = data.pivot(index=samples_col, columns=labels_col, values='present')
return multihot.fillna(0).astype(int)
Using sklearn:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
def multihotencode(data, samples_col, labels_col):
## Aggregate the labels into a tuple for each sample
tupled = data.groupby(samples_col).apply(lambda x: tuple(x[labels_col].to_numpy()))
## Get multi-hot encoded matrix.
mlb = MultiLabelBinarizer()
multihot = mlb.fit_transform(samplepgs)
## Reapply the index and column headers
return pd.DataFrame(multihot, index=tupled.index, columns=mlb.classes_)
Both methods give the same result:
df = pd.DataFrame([
['Aci', 'FB'],
['Dan', 'TWT'],
['Ann', 'GRS'],
['Aci', 'GRS'],
['Dan', 'FB']
], columns=['Name', 'Class'])
multihotencode(df, 'Name', 'Class')
>>>
FB GRS TWT
Name
Aci 1 1 0
Ann 0 1 0
Dan 1 0 1
Upvotes: 1
Reputation: 863781
Use get_dummies
with DataFrame.set_index
and aggregate max
or sum
:
#always 0,1 in output
df1 = pd.get_dummies(df.set_index('Name')['Class']).max(level=0).reset_index()
#if need count values
#df1 = pd.get_dummies(df.set_index('Name')['Class']).sum(level=0).reset_index()
print (df1)
Name FB GRS TWT
0 Aci 1 1 0
1 Dan 1 0 1
2 Ann 0 1 0
Upvotes: 2