Reputation: 157
In most of the Academic examples, we used to convert categorical features using get_dummies()
or OneHotEncoding()
. Lets say I want to use Country as a feature and in the dataset we have 100 unique countries. When we apply get_dummies()
or OneHotEncoding()
on country we will get 100 columns and model will be trained with 100 country columns + other features.
Lets say, we have deployed this model into production, and we received only 10 countries. When we pre-process the data by using get_dummies()
or OneHotEncoding()
, then model will fail predict because "Number of features model trained is not matching with the features passed" as we are passing 10 country columns + other features.
Can you please help me to understand how to handle such scenarios.How to deal with Large number of Categorical variables in multiple columns can be pre-process in the Model building.
Upvotes: 1
Views: 2449
Reputation: 1425
The pandas.get_dummies()
function indeed should not be used in deployment, for the reason you described. The scikit-learn's OneHotEncoder, though, handles this situation just fine:
from sklearn import preprocessing
import pandas as pd
ohe = preprocessing.OneHotEncoder(handle_unknown='ignore')
X_train = pd.DataFrame({'country':['USA', 'Russia', 'China', 'Spain']})
X_test = pd.DataFrame({'country':['Russia', 'Ukraine', 'China', 'Russia']})
ohe.fit(X_train)
ohe.transform(X_test).toarray()
array([[0., 1., 0., 0.],
[0., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.]])
(Here I have set handle_unknown='ignore'
so that new labels ('Ukraine') get encoded as all zeros. If you set handle_unknown='error'
(which is the default), new labels will be raising errors.) So, the OneHotEncoder can handle a different set of labels in the test set.
Upvotes: 2