Nikhil Mishra
Nikhil Mishra

Reputation: 1250

How do I resolve one hot encoding if my test data has missing values in a col?

For example if my training data has the categorical values (1,2,3,4,5) in the col,then one hot encoding will give me 5 cols. But in the test data I have, say only 4 out of the 5 values i.e.(1,3,4,5).So one hot encoding will give me only 4 cols.Therefore if I apply my trained weights on the test data, I will get an error as the dimensions of the cols do not match in the train and test data, dim(4)!=dim(5).Any suggestions on what do I do with the missing col values? The image of my code is provided below:

image

Upvotes: 10

Views: 7502

Answers (3)

Basil C Sunny
Basil C Sunny

Reputation: 424

Use dummy(binary) encoding instead of one hot encoding. Pandas pd.dummies() with drop_first = True creates dummy encoding to get k-1 dummies out of k categorical levels by removing the first level. The default option drop_first = False creates one hot encoding.

See pandas official documentation

Also dummy(binary) encoding creates less number of columns.

Upvotes: 0

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

You can first combine two dataframes, then get_dummies then split them so they can have exact number of columns i.e

#Example Dataframes 
Xtrain = pd.DataFrame({'x':np.array([4,2,3,5,3,1])})
Xtest = pd.DataFrame({'x':np.array([4,5,1,3])})


# Concat with keys then get dummies
temp = pd.get_dummies(pd.concat([Xtrain,Xtest],keys=[0,1]), columns=['x'])

# Selecting data from multi index and assigning them i.e
Xtrain,Xtest = temp.xs(0),temp.xs(1)

# Xtrain.as_matrix()
# array([[0, 0, 0, 1, 0],
#        [0, 1, 0, 0, 0],
#        [0, 0, 1, 0, 0],
#        [0, 0, 0, 0, 1],
#        [0, 0, 1, 0, 0],
#        [1, 0, 0, 0, 0]], dtype=uint8)

# Xtest.as_matrix()

# array([[0, 0, 0, 1, 0],
#        [0, 0, 0, 0, 1],
#        [1, 0, 0, 0, 0],
#        [0, 0, 1, 0, 0]], dtype=uint8)

Do not follow this approach. Its a simple trick with lot of disadvantages. @Vast Academician answer explains better.

Upvotes: 5

Vast Academician
Vast Academician

Reputation: 357

Guys don't do this mistake, please!

Yes, you can do this hack with the concatenation of train and test and fool yourself, but the real problem is in production. There your model will someday face an unknown level of your categorical variable and then break.

In reality, some of the more viable options could be:

  1. Retrain your model periodically to account for new data.
  2. Do not use one hot. Seriously, there are many better options like leave one out encoding (https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154) conditional probability encoding (https://medium.com/airbnb-engineering/designing-machine-learning-models-7d0048249e69), target encoding to name a few. Some classifiers like CatBoost even have a built-in mechanism for encoding, there are mature libraries like target_encoders in Python, where you will find lots of other options.
  3. Embed categorical features and this could save you from a complete retrain (http://flovv.github.io/Embeddings_with_keras/)

Upvotes: 17

Related Questions