roschach
roschach

Reputation: 9336

How to transform categorical values of a dataframe column to a one-hot encoded columns in sckikit-learn?

Given the column letters with categorical values ['A', 'B','C'] of the dataframe df, I want to obtain many columns in the dataframe, where each row has just one non-zero value corresponding to the original categorical value.

For example from the dataframe

letters
A
C
A
C

I want to have:

A    B    C
1    0    0
0    0    1 
1    0    0
0    0    1

Now in pandas this is very easy:

dummies = pd.get_dummies(df.letters)
res = pd.concat([df, dummies], axis=1)
df.drop('letters', axis=1, inplace=True)

In scikit the LabelBinarizer can be used:

from sklearn.preprocessing import LabelBinarizer
cat_col = df["letters"]
encoder = LabelBinarizer()
col_1hot = encoder.fit_transform(cat_col)
col_1hot.toarray()

However this is just a 1-0 matrix and I've lost the references to the original categorical value. Thus, I cannot assume the first 1-hot column is A, the second is B and so on. So how do I execute one-hot encoding in SciKit learn to obtain a dataframe?

EDIT

@Joe Halliwell suggested doing something lb.inverse_transform(onehot) so in this specific case I did

lb = preprocessing.LabelBinarizer()
onehot = lb.fit_transform(df)
res = pd.DataFrame(data=onehot, columns=lb.inverse_transform(onehot).reshape(-1,))

which works in this case because the number of rows is equal to the categories. If I have more rows, this does not work anymore

Upvotes: 2

Views: 225

Answers (1)

Joe Halliwell
Joe Halliwell

Reputation: 1177

You can't use SciKit Learn to obtain a DataFrame. You can however call the inverse_transform method on your fitted LabelBinarizer to retrieve the labels e.g.

import pandas as pd
from sklearn import preprocessing

df = pd.DataFrame({"foo": ["A", "C", "B", "A"]})
lb = preprocessing.LabelBinarizer()
onehot = lb.fit_transform(df)
print(lb.inverse_transform(onehot))

So, in order to retrieve the labels for the columns in your one-hot encoded matrix, you can run the identity matrix through the inverse transform:

print(lb.inverse_transform(np.identity(onehot.shape[1]))

Upvotes: 2

Related Questions