Reputation: 9336
Given the column letters
with categorical values ['A', 'B','C']
of the dataframe df
, I want to obtain many columns in the dataframe, where each row has just one non-zero value corresponding to the original categorical value.
For example from the dataframe
letters
A
C
A
C
I want to have:
A B C
1 0 0
0 0 1
1 0 0
0 0 1
Now in pandas this is very easy:
dummies = pd.get_dummies(df.letters)
res = pd.concat([df, dummies], axis=1)
df.drop('letters', axis=1, inplace=True)
In scikit the LabelBinarizer
can be used:
from sklearn.preprocessing import LabelBinarizer
cat_col = df["letters"]
encoder = LabelBinarizer()
col_1hot = encoder.fit_transform(cat_col)
col_1hot.toarray()
However this is just a 1-0
matrix and I've lost the references to the original categorical value. Thus, I cannot assume the first 1-hot column is A
, the second is B
and so on.
So how do I execute one-hot encoding in SciKit learn to obtain a dataframe?
EDIT
@Joe Halliwell suggested doing something lb.inverse_transform(onehot)
so in this specific case I did
lb = preprocessing.LabelBinarizer()
onehot = lb.fit_transform(df)
res = pd.DataFrame(data=onehot, columns=lb.inverse_transform(onehot).reshape(-1,))
which works in this case because the number of rows is equal to the categories. If I have more rows, this does not work anymore
Upvotes: 2
Views: 225
Reputation: 1177
You can't use SciKit Learn to obtain a DataFrame. You can however call the inverse_transform
method on your fitted LabelBinarizer
to retrieve the labels e.g.
import pandas as pd
from sklearn import preprocessing
df = pd.DataFrame({"foo": ["A", "C", "B", "A"]})
lb = preprocessing.LabelBinarizer()
onehot = lb.fit_transform(df)
print(lb.inverse_transform(onehot))
So, in order to retrieve the labels for the columns in your one-hot encoded matrix, you can run the identity matrix through the inverse transform:
print(lb.inverse_transform(np.identity(onehot.shape[1]))
Upvotes: 2