Reputation: 608
How do I encode the table below efficiently?
e.g.
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3], [2, 3, 4], [1, 3, 4]]), columns=['col_1', 'col_2', 'col_3'])
col_1 col_2 col_3
0 1 2 3
1 2 3 4
2 1 3 4
to
1 2 3 4
0 1 1 1 0
1 0 1 1 1
2 1 0 1 1
Upvotes: 0
Views: 42
Reputation: 221574
Here's one way -
def hotencode(df):
unq, idx = np.unique(df, return_inverse=1)
col_idx = idx.reshape(df.shape)
out = np.zeros((len(col_idx),col_idx.max()+1),dtype=int)
out[np.arange(len(col_idx))[:,None], col_idx] = 1
return pd.DataFrame(out, columns=unq, index=df.index)
Another way with broadcasting
would be -
unq = np.unique(df)
out = (df.values[...,None] == unq).any(1).astype(int)
Sample run -
In [81]: df
Out[81]:
col_1 col_2 col_3
0 1 2 3
1 2 3 4
2 1 3 4
In [82]: hotencode(df)
Out[82]:
1 2 3 4
0 1 1 1 0
1 0 1 1 1
2 1 0 1 1
Upvotes: 1