One hot coding in Train Validation and Test set (Production data)

Question

For example I have below train set.

   name     values
0  Tony      100
1  Smith     110
2  Sam       120
3  Shane     130
4  Sam       140
5  Ram       160

After one hot encoding it becomes

    values   0    1    2    3    4   
0   100      1    0    0    0    0
1   110      0    1    0    0    0
2   120      0    0    1    0    0
3   130      0    0    0    1    0 
4   140      0    0    1    0    0
5   160      0    0    0    0    1

Now suppose i have test data in production with Danny a new level in name:

   name     values
0  Shane      200
1  Danny      210
2  Sam        220
3  Tony       180
4  Danny      150

After one hot encoding of this

    values   0    1    2    3    
0   200      1    0    0    0 
1   210      0    1    0    0
2   220      0    0    1    0
3   180      0    0    0    1
4   150      0    1    0    0

I have few question based on above situations :

How to deal with new entry of level or value of categorical variable in production test data?
How to maintain input feature size for model(for above example it was 6 in training and 5 in test data) ?
Also Tony was feature 0 in train set however in test it is feature 3; does it impact prediction of test input from trained model?

Alex Serra Marrugat · Accepted Answer

How to deal with new entry of level in production test data?

OneHotEncoder has an hyperparameter for this issue: handle_unknown

handle_unknown{‘error’, ‘ignore’}, default=’error’ Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

As you can see, you have two different values for this hyperparameter. If in your test can appear new class (like your example with Danny), I recommend use the value ignore:

enc = OneHotEncoder(handle_unknown='ignore')

How to maintain input feature size for model? Order of classes?

The model will always keep the input feature size of the fitted data. For example, using your data provided, if you fitted your OneHotEncoder with training data you will always have 6 inputs.

And, also always, this inputs will have the same categories of the training data. I mean, in you data, feature 0 will always refers to Tony, feature 1 to Smith, ..

If you want to transfer the OneHotEncoder fitted to another script, you can do it using joblib library. For example:

import joblib

enc = OneHotEncoder(handle_unknown='error')
enc.fit(data)
joblib.dump(enc, 'encoder.joblib')

And then, load from another script:

enc = joblib.load('encoder.joblib')

Clarifications

Finally, I would like to clarify the process and how you OneHotEncode because I think it's not clear at all:

For OneHotEncoding, first, you need to fit to one dataset (almost always Training data). What are you doing in this step? Basically you are telling how many, which and the order of the class (in your case: you have 6 classes, with order: Tony, Smith, ..)
Then, you can transform any data using this previous OneHotEncoder fitted using transform. For example, the results of your test would be:

Shane, ever thought, it is the first class that appear in your test data, it will remain the feature 3 (so a 1 will appear in feature 3 and zeros to other features), since it was defined in the fit part with training data.

Danny, won't have any 1 in features, since this names didn't appear in training data. As we said in question 1, if you set the hyperparameter handle_unknow to error, you will obtain a mistake, if you set it to ignore you will carry on with all features with a 0.

So basically, as you can see, you are first fitting to one data, and then applying what you learned to transform another data. You have to fit only once OneHotEncoder.

Note: you can fit and transform training data with one step doing: fit_transform

One hot coding in Train Validation and Test set (Production data)

Answers (1)

How to deal with new entry of level in production test data?

How to maintain input feature size for model? Order of classes?

Clarifications

Related Questions