Reputation: 428
For example I have below train set.
name values
0 Tony 100
1 Smith 110
2 Sam 120
3 Shane 130
4 Sam 140
5 Ram 160
After one hot encoding it becomes
values 0 1 2 3 4
0 100 1 0 0 0 0
1 110 0 1 0 0 0
2 120 0 0 1 0 0
3 130 0 0 0 1 0
4 140 0 0 1 0 0
5 160 0 0 0 0 1
Now suppose i have test data in production with Danny
a new level in name
:
name values
0 Shane 200
1 Danny 210
2 Sam 220
3 Tony 180
4 Danny 150
After one hot encoding of this
values 0 1 2 3
0 200 1 0 0 0
1 210 0 1 0 0
2 220 0 0 1 0
3 180 0 0 0 1
4 150 0 1 0 0
I have few question based on above situations :
Tony
was feature 0 in train set however in test it is feature 3; does it impact prediction of test input from trained model?Upvotes: 3
Views: 912
Reputation: 2042
OneHotEncoder
has an hyperparameter for this issue: handle_unknown
handle_unknown{‘error’, ‘ignore’}, default=’error’ Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
As you can see, you have two different values for this hyperparameter. If in your test can appear new class (like your example with Danny), I recommend use the value ignore
:
enc = OneHotEncoder(handle_unknown='ignore')
The model will always keep the input feature size of the fitted data. For example, using your data provided, if you fitted your OneHotEncoder
with training data you will always have 6 inputs.
And, also always, this inputs will have the same categories of the training data. I mean, in you data, feature 0 will always refers to Tony, feature 1 to Smith, ..
If you want to transfer the OneHotEncoder fitted to another script, you can do it using joblib
library. For example:
import joblib
enc = OneHotEncoder(handle_unknown='error')
enc.fit(data)
joblib.dump(enc, 'encoder.joblib')
And then, load from another script:
enc = joblib.load('encoder.joblib')
Finally, I would like to clarify the process and how you OneHotEncode because I think it's not clear at all:
For OneHotEncoding, first, you need to fit
to one dataset (almost
always Training data). What are you doing in this step? Basically
you are telling how many, which and the order of the class (in your
case: you have 6 classes, with order: Tony, Smith, ..)
Then, you can transform any data using this previous OneHotEncoder fitted using transform
. For example, the results of your test would be:
Shane, ever thought, it is the first class that appear in your test data, it will remain the feature 3 (so a 1 will appear in feature 3 and zeros to other features), since it was defined in the fit
part with training data.
Danny, won't have any 1 in features, since this names didn't appear in training data. As we said in question 1, if you set the hyperparameter handle_unknow
to error
, you will obtain a mistake, if you set it to ignore
you will carry on with all features with a 0.
So basically, as you can see, you are first fitting to one data, and then applying what you learned to transform another data. You have to fit only once OneHotEncoder
.
Note: you can fit and transform training data with one step doing: fit_transform
Upvotes: 3