Reputation: 1675
How do I test a text classification problem with unknown words? In training a model, we can use smoothing technique (Laplace add-1) to make sure any word will receive at least 1 count for each class.
Then, what about at testing stage? If a word doesn't occur in the training data, what's the best way to deal with it? Simply skip it, or also give an occurrence of 1 to it?
Thanks, for any suggestions or opinions. Specifically, I am using a Naive Bayes classifier.
Upvotes: 2
Views: 3263
Reputation: 5971
When you come to classify an instance, think about whats going on, if you do the add-1 smoothing for an unseen feature, then you'd simply multiply a very small probability (1 / vocabSize) (or add the log of a very small probability) to your accumulated scores. If you are skipping the unseen feature then nothing happens to the scores.
So, generally speaking an unseen feature in your test data shouldn't make a difference to your classification decision - you know nothing about it as you haven't seen it in training, so in the case of smoothing you'd be multiplying (or adding) the same small (log-)probability to all your scores per class or you'd simply ignore it for all of your class scores.
If you're not convinced, simply try both and see if it makes any difference.
Upvotes: 3