Pierre-Antoine
Pierre-Antoine

Reputation: 8129

How to re-use LabelBinarizer for input prediction in Scikit-Learn

I trained a classifier using Scikit-Learn. I am loading the input to train my classifier from a CSV. The value of some of my columns (e.g. 'Town') are canonical (e.g. can be 'New York', 'Paris', 'Stockholm', ...) . In order to use those canonical columns, I am doing one-hot encoding with the LabelBinarizer from Scikit-Learn.

This is how I transform data before training:

import pandas as pd
from sklearn.preprocessing import LabelBinarizer

headers = [ 
    'Ref.', 'Town' #,...
]

df = pd.read_csv("/path/to/some.csv", header=None, names=headers, na_values="?")

lb = LabelBinarizer()
lb_results = lb.fit_transform(df['Town'])

It is however not clear to me how to use the LabelBinarizer to create feature vectors using new input data for which I want to do predictions. Especially, if new data contains a seen town (eg New York) it needs to be encoded at the same place as the same town in the training data.

How is the Label Binarization supposed to be re-applied on new input data?

(I don't have a strong feeling on Scikit-Learn, if someone know how to do it with Pandas' get_dummies method that is fine too.)

Upvotes: 4

Views: 1785

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210982

Just use lb.transform() for already trained lb model.

Demo:

Assuming we have the following train DF:

In [250]: df
Out[250]:
           Town
0      New York
1        Munich
2          Kiev
3         Paris
4        Berlin
5      New York
6  Zaporizhzhia

Fit (train) & transform (binarize) in one step:

In [251]: r1 = pd.DataFrame(lb.fit_transform(df['Town']), columns=lb.classes_)

Yields:

In [252]: r1
Out[252]:
   Berlin  Kiev  Munich  New York  Paris  Zaporizhzhia
0       0     0       0         1      0             0
1       0     0       1         0      0             0
2       0     1       0         0      0             0
3       0     0       0         0      1             0
4       1     0       0         0      0             0
5       0     0       0         1      0             0
6       0     0       0         0      0             1

lb is trained now for those towns, that we had in the df

Now we can binarize new data sets using trained lb model (using lb.transform()):

In [253]: new
Out[253]:
       Town
0    Munich
1  New York
2     Dubai  # <--- new (not trained) town

In [254]: r2 = pd.DataFrame(lb.transform(new['Town']), columns=lb.classes_)

In [255]: r2
Out[255]:
   Berlin  Kiev  Munich  New York  Paris  Zaporizhzhia
0       0     0       1         0      0             0
1       0     0       0         1      0             0
2       0     0       0         0      0             0

Upvotes: 4

Related Questions