Reputation: 9355
I have a Pandas DataFrame, train
, that I'm one-hot encoding. It looks something like this:
car
0 Mazda
1 BMW
2 Honda
If I use pd.get_dummies
, I'll get this:
car_BMW car_Honda car_Mazda
0 0 0 1
1 1 0 0
2 0 1 0
All good so far.
However, I don't have access to my test set so I need to handle the possibility that a value for car
appears in test that wasn't seen in train
.
Suppose test
is this:
car
0 Mazda
1 Audi
Then if I use pd.get_dummies
on test
, I get:
car_Audi car_Mazda
0 0 1
1 1 0
Which is wrong, because I have a new column, car_Audi
and am missing car_BMW
.
I'd like the output of one-hot encoding test
to be:
car_BMW car_Honda car_Mazda
0 0 0 1
1 0 0 0
So it just ignores previously unseen values in test. I definitely don't want to create new columns for previously unseen values in test.
I've looked into sklearn.preprocessing.LabelBinarizer
but it outputs a numpy array and the order isn't clear for the columns:
lb = LabelBinarizer()
train_transformed = lb.fit_transform(train_df)
gives me back:
array([[0, 0, 1],
[1, 0, 0],
[0, 1, 0]])
Any ideas here?
Thanks!
Upvotes: 3
Views: 860
Reputation: 403218
This isn't a hard problem to solve. LabelBinarizer
has a parameter classes_
you can query if you want to know the position of the original labels:
train_transformed = lb.fit_transform(df)
print(train_transformed)
array([[0, 0, 1],
[1, 0, 0],
[0, 1, 0]])
print(lb.classes_)
array(['BMW', 'Honda', 'Mazda'], dtype='<U5')
Upvotes: 1