Aditya Sharma
Aditya Sharma

Reputation: 157

using get_dummies() and OneHotEncoding on large number of Categorical Variable

In most of the Academic examples, we used to convert categorical features using get_dummies() or OneHotEncoding(). Lets say I want to use Country as a feature and in the dataset we have 100 unique countries. When we apply get_dummies() or OneHotEncoding() on country we will get 100 columns and model will be trained with 100 country columns + other features.

Lets say, we have deployed this model into production, and we received only 10 countries. When we pre-process the data by using get_dummies() or OneHotEncoding(), then model will fail predict because "Number of features model trained is not matching with the features passed" as we are passing 10 country columns + other features.

Can you please help me to understand how to handle such scenarios.How to deal with Large number of Categorical variables in multiple columns can be pre-process in the Model building.

Upvotes: 1

Views: 2449

Answers (1)

Viktoriya Malyasova
Viktoriya Malyasova

Reputation: 1425

The pandas.get_dummies() function indeed should not be used in deployment, for the reason you described. The scikit-learn's OneHotEncoder, though, handles this situation just fine:

from sklearn import preprocessing
import pandas as pd

ohe = preprocessing.OneHotEncoder(handle_unknown='ignore')
X_train = pd.DataFrame({'country':['USA', 'Russia', 'China', 'Spain']})
X_test = pd.DataFrame({'country':['Russia', 'Ukraine', 'China', 'Russia']})
ohe.fit(X_train) 
ohe.transform(X_test).toarray()

array([[0., 1., 0., 0.],
       [0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.]])

(Here I have set handle_unknown='ignore' so that new labels ('Ukraine') get encoded as all zeros. If you set handle_unknown='error' (which is the default), new labels will be raising errors.) So, the OneHotEncoder can handle a different set of labels in the test set.

Upvotes: 2

Related Questions