Codevan
Codevan

Reputation: 558

Convert 1 column data into multi hot encoding

As an example to the problem, suppose we have a dataframe:

    Name Class
0   Aci  FB 
1   Dan  TWT
2   Ann  GRS
3   Aci  GRS
4   Dan  FB 

The resulted dataframe would be df

   Name  FB   TWT  GRS
0  Aci   1    0   1
0  Dan   1    1   0
0  Ann   0    0   1

Upvotes: 2

Views: 666

Answers (3)

G.G
G.G

Reputation: 765

pd.get_dummies(df1,columns=['Class']).groupby('Name').sum().rename(columns={'Class_FB':'FB','Class_GRS':'GRS','Class_TWT':'TWT'})

   FB  GRS  TWT
Name              
Aci    1    1    0
Ann    0    1    0
Dan    1    0    1

Upvotes: 0

Tristan Brown
Tristan Brown

Reputation: 591

The other solutions were failing for me due to memory overflows when I have 315 samples and 1908 labels. Here are some more performant methods.

Using pandas pivot:

def multihotencode(data, samples_col, labels_col):
    data = data.copy()
    data['present'] = 1
    multihot = data.pivot(index=samples_col, columns=labels_col, values='present')
    return multihot.fillna(0).astype(int)

Using sklearn:

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

def multihotencode(data, samples_col, labels_col):

    ## Aggregate the labels into a tuple for each sample
    tupled = data.groupby(samples_col).apply(lambda x: tuple(x[labels_col].to_numpy()))

    ## Get multi-hot encoded matrix.
    mlb = MultiLabelBinarizer()
    multihot = mlb.fit_transform(samplepgs)

    ## Reapply the index and column headers
    return pd.DataFrame(multihot, index=tupled.index, columns=mlb.classes_)

Both methods give the same result:

df = pd.DataFrame([
    ['Aci', 'FB'],
    ['Dan', 'TWT'],
    ['Ann', 'GRS'],
    ['Aci', 'GRS'],
    ['Dan', 'FB']
], columns=['Name', 'Class'])

multihotencode(df, 'Name', 'Class')
>>> 
    FB  GRS TWT
Name            
Aci 1   1   0
Ann 0   1   0
Dan 1   0   1

Upvotes: 1

jezrael
jezrael

Reputation: 863781

Use get_dummies with DataFrame.set_index and aggregate max or sum:

#always 0,1 in output
df1 = pd.get_dummies(df.set_index('Name')['Class']).max(level=0).reset_index()
#if need count values
#df1 = pd.get_dummies(df.set_index('Name')['Class']).sum(level=0).reset_index()

print (df1)
  Name  FB  GRS  TWT
0  Aci   1    1    0
1  Dan   1    0    1
2  Ann   0    1    0

Upvotes: 2

Related Questions