text mining preprocessing must be applied to test or to train set?

Question

I'm doing some text-mining tasks and I have such a simple question and I still can't reach a conclusion.

I am applying pre-processing, such as tokenization and stemming to my training set so i can train my model.

Should I also apply this pre-processing to my test set?

berkayln · Accepted Answer

Yes, you should apply same things to your test set. Because you test set must represent your train set, that's why they should be from same distribution. Let's think intuitively:

You will enter an exam. In order you to prepare for exam and get a normal result, lecturer should ask from same subjects in the lectures. But if the lecturer ask questions from a totally different subjects that no one has seen, it is not possible to get a normal result.

text mining preprocessing must be applied to test or to train set?

Answers (2)

Related Questions