Reputation: 57
I have used weka for text classification. First I used StringToWordVector filter and filtered data were used with SVM classifier (LibSVM) for cross validation. Later I have read a blog post here
It said that it is not suitable to use filter first and then perform cross validation. Instead it proposes FilteredClassifer to use. His justification is
Two weeks ago, I wrote a post on how to chain filters and classifiers in WEKA, in order to avoid misleading results when performing experiments with text collections. The issue was that, when using N Fold Cross Validation (CV) in your data, you should not apply the StringToWordVector (STWV) filter on the full data collection and then perform the CV evaluation on your data, because you would be using words that are present in your test subset (but not in your training subset) for each run.
I can not understand the reason behind this. Anyone knows that?
Upvotes: 1
Views: 1427
Reputation: 1947
When you using filter before N Fold cross validation you would be filtering every word appear in each instance despite being a test instance or train instance. At the moment Filter has no way to know if a instance is a test instance or a train instance. So if you are using StringtoWordVector with TFTransform or any similar operation, any word in test instances may affect the transform value. (Simply, if you are implementing bag of words then you would take test instance for the consideration too). This is not acceptable since the training parameters should not affected by the testing data. So instead you can do the Filtering on the run. That is FilteredClassifer.
For get an idea about how N Fold cross validation works, please refer to the Rushdi Shams's answer in following question. Please let me know if you understood it or not. Cheers..!!
Upvotes: 1