pandas get_dummies cannot handle unseen labels in test data

Question

I have a Pandas DataFrame, train, that I'm one-hot encoding. It looks something like this:

    car
0   Mazda
1   BMW
2   Honda

If I use pd.get_dummies, I'll get this:

car_BMW car_Honda   car_Mazda
0   0       0           1
1   1       0           0
2   0       1           0

All good so far.

However, I don't have access to my test set so I need to handle the possibility that a value for car appears in test that wasn't seen in train.

Suppose test is this:

    car
0   Mazda
1   Audi

Then if I use pd.get_dummies on test, I get:

car_Audi    car_Mazda
0   0           1
1   1           0

Which is wrong, because I have a new column, car_Audi and am missing car_BMW.

I'd like the output of one-hot encoding test to be:

car_BMW car_Honda   car_Mazda
0   0       0           1
1   0       0           0

So it just ignores previously unseen values in test. I definitely don't want to create new columns for previously unseen values in test.

I've looked into sklearn.preprocessing.LabelBinarizer but it outputs a numpy array and the order isn't clear for the columns:

lb = LabelBinarizer()
train_transformed = lb.fit_transform(train_df)

gives me back:

array([[0, 0, 1],
       [1, 0, 0],
       [0, 1, 0]])

Any ideas here?

Thanks!

cs95 · Accepted Answer

This isn't a hard problem to solve. LabelBinarizer has a parameter classes_ you can query if you want to know the position of the original labels:

train_transformed = lb.fit_transform(df)

print(train_transformed)
array([[0, 0, 1],
       [1, 0, 0],
       [0, 1, 0]])

print(lb.classes_)
array(['BMW', 'Honda', 'Mazda'], dtype='

pandas get_dummies cannot handle unseen labels in test data

Answers (1)

Related Questions