Reputation: 327
I'm doing some text-mining tasks and I have such a simple question and I still can't reach a conclusion.
I am applying pre-processing, such as tokenization and stemming to my training set so i can train my model.
Should I also apply this pre-processing to my test set?
Upvotes: 0
Views: 244
Reputation: 81
Of course you should. If not, how do you input your test data into your trained model?
Upvotes: 0
Reputation: 995
Yes, you should apply same things to your test set. Because you test set must represent your train set, that's why they should be from same distribution. Let's think intuitively:
You will enter an exam. In order you to prepare for exam and get a normal result, lecturer should ask from same subjects in the lectures. But if the lecturer ask questions from a totally different subjects that no one has seen, it is not possible to get a normal result.
Upvotes: 1