Reputation: 1001
I'm working with a multi class text classification data set having train and test sets. There are around 470 unique labels in training set and around 250 unique labels in test set. (These 470+ 250 unique labels comes from a large set of labels of size 4 million. )
There are around 30 labels which are only in test set but not in training set.
DO I need to encode each label into a one hot vector of size 4 million rather than 450 ? so that I can handle those missing 30 labels also
Upvotes: 3
Views: 2709
Reputation: 7432
There is no way that your model can learn labels that it hasn't seen! Ideally in Machine Learning you assume that the training set and the test set are sampled from the same underlying distribution. The model can only learn what you teach it, so you need to make sure that you train and test it on similar data!
You could try to merge your two sets together and then re-split them into a training and test set so that they both have the same number of classes. Furthermore, make sure you have enough data. Your model can't learn from a class it has seen once or twice. In order for the model to learn 500 classes you should have hundreds of thousands of samples! If not maybe try merging some of your classes together.
Upvotes: 2