Hardian Lawi
Hardian Lawi

Reputation: 608

multiple same categorical variables into one hot encoded columns efficiently

How do I encode the table below efficiently?

e.g.

import pandas as pd

df = pd.DataFrame(np.array([[1, 2, 3], [2, 3, 4], [1, 3, 4]]), columns=['col_1', 'col_2', 'col_3'])

   col_1  col_2  col_3
0      1      2      3
1      2      3      4
2      1      3      4

to

       1      2      3      4
0      1      1      1      0
1      0      1      1      1
2      1      0      1      1

Upvotes: 0

Views: 42

Answers (1)

Divakar
Divakar

Reputation: 221574

Here's one way -

def hotencode(df):
    unq, idx = np.unique(df, return_inverse=1)
    col_idx = idx.reshape(df.shape)
    out = np.zeros((len(col_idx),col_idx.max()+1),dtype=int)
    out[np.arange(len(col_idx))[:,None], col_idx] = 1
    return pd.DataFrame(out, columns=unq, index=df.index)

Another way with broadcasting would be -

unq = np.unique(df)
out = (df.values[...,None] == unq).any(1).astype(int)

Sample run -

In [81]: df
Out[81]: 
   col_1  col_2  col_3
0      1      2      3
1      2      3      4
2      1      3      4

In [82]: hotencode(df)
Out[82]: 
   1  2  3  4
0  1  1  1  0
1  0  1  1  1
2  1  0  1  1

Upvotes: 1

Related Questions