Reputation: 1250
For example if my training data has the categorical values (1,2,3,4,5) in the col,then one hot encoding will give me 5 cols. But in the test data I have, say only 4 out of the 5 values i.e.(1,3,4,5).So one hot encoding will give me only 4 cols.Therefore if I apply my trained weights on the test data, I will get an error as the dimensions of the cols do not match in the train and test data, dim(4)!=dim(5).Any suggestions on what do I do with the missing col values? The image of my code is provided below:
Upvotes: 10
Views: 7502
Reputation: 424
Use dummy(binary) encoding instead of one hot encoding. Pandas pd.dummies()
with drop_first = True
creates dummy encoding to get k-1 dummies out of k categorical levels by removing the first level. The default option drop_first = False
creates one hot encoding.
See pandas official documentation
Also dummy(binary) encoding creates less number of columns.
Upvotes: 0
Reputation: 30605
You can first combine two dataframes, then get_dummies then split them so they can have exact number of columns i.e
#Example Dataframes
Xtrain = pd.DataFrame({'x':np.array([4,2,3,5,3,1])})
Xtest = pd.DataFrame({'x':np.array([4,5,1,3])})
# Concat with keys then get dummies
temp = pd.get_dummies(pd.concat([Xtrain,Xtest],keys=[0,1]), columns=['x'])
# Selecting data from multi index and assigning them i.e
Xtrain,Xtest = temp.xs(0),temp.xs(1)
# Xtrain.as_matrix()
# array([[0, 0, 0, 1, 0],
# [0, 1, 0, 0, 0],
# [0, 0, 1, 0, 0],
# [0, 0, 0, 0, 1],
# [0, 0, 1, 0, 0],
# [1, 0, 0, 0, 0]], dtype=uint8)
# Xtest.as_matrix()
# array([[0, 0, 0, 1, 0],
# [0, 0, 0, 0, 1],
# [1, 0, 0, 0, 0],
# [0, 0, 1, 0, 0]], dtype=uint8)
Do not follow this approach. Its a simple trick with lot of disadvantages. @Vast Academician answer explains better.
Upvotes: 5
Reputation: 357
Guys don't do this mistake, please!
Yes, you can do this hack with the concatenation of train and test and fool yourself, but the real problem is in production. There your model will someday face an unknown level of your categorical variable and then break.
In reality, some of the more viable options could be:
Upvotes: 17