Reputation: 1
I have built a classifier and would like to improve its accuracy beyond its current 73%
I started to incorporate feature selection using Chi Square but how do I get the features selected back into the training data to build the classifier?
If I were to do a comparison for each training data and only pick terms that appear in the features list, would that be correct?
Also do I need to do the same for the test set data, which are unseen examples?
Any advice will be much appreciated.
Upvotes: 0
Views: 374
Reputation: 2600
It's worth to make a little addition to blue_note's answer.
In order to prevent overfitting and assure your model will generalize, you should test your feature selection strategy in a separate development set. The intuition is: if you try a big number of different models (i.e. classifiers trained in different features subsets), it's likely some will perform better than others in training set just by chance. To be sure one particular model is really better than others, you need to test it in a different set, with examples not seen during training.
Upvotes: 0
Reputation: 29071
Simply put, feature selection essentially says (for example): "Of the 5 attributes of the input vector, only features 1,3,4 are useful. Features 2,5 are junk. Don't use them at all". This goes for both the training and the test patterns, since they come from the same distribution. So you drop features 2 and 5 from both the training and test patterns, and then you train and test your classifier in the usual way.
More generally, the point of feature extraction (which is a superset of feature selection) is to transform the original input vector to a different input vector, more suitable for classification. You transform both the training and the test patterns to the new form, essentially creating a new problem from the original. Note that the values may appear in the original pattern, or not (they may be produced by a combination of function and values from the original pattern) Then you use new, transformed problme to both train and test the classifier
Upvotes: 1