Reputation: 4668
I get contradicting results with two methods that in my mind should produce the same output. Could someone point out what the differences are because I can't get my head around it :S
I am working on Drexel_Stats.arff. I use a 1-NN classifier with 10 fold cross validation. Without any preprocessing this is the confusion matrix I get:
a b <-- classified as
14 3 | a = Win
5 1 | b = Loss
To get better results I used:
weka.attributeSelection.InfoGainAttributeEval
weka.attributeSelection.Ranker -T -1.0 -N 5
to get the 5 most discriminating features of the data set. Then I manually got rid of all the other features and re-run my 1-NN and I got these results:
a b <-- classified as
16 1 | a = Win
1 5 | b = Loss
Now that's where it gets confusing (to me at least). I tried to use a meta filtered classifier to save the hassle of manually discarding features. Here is what I used (copied from the GUI):
weka.classifiers.meta.FilteredClassifier
-F "weka.filters.supervised.attribute.AttributeSelection
-E \"weka.attributeSelection.InfoGainAttributeEval \"
-S \"weka.attributeSelection.Ranker -T -1.0 -N 5\""
-W weka.classifiers.lazy.IB1 -D
I understand this as being an automation of the previous operation but in fact the results I get this time are different:
a b <-- classified as
15 2 | a = Win
4 2 | b = Loss
What did I get wrong?
Thanks
edit: Here is part of the WEKA output:
=== Attribute Selection on all input data ===
Search Method:
Attribute ranking.
Attribute Evaluator (supervised, Class (nominal): 39 Outcome):
Information Gain Ranking Filter
Ranked attributes:
0.828 1 Opponent
0.469 38 Opp_Steals
0.42 24 Opp_Field_Goal_Pct
0.331 15 Def_Rebounds
0.306 28 Opp_Free_Throws_Made
Selected attributes: 1,38,24,15,28 : 5
Header of reduced data:
@relation 'Basketball_Statistics-weka.filters.unsupervised.attribute.Remove-V-R1,38,24,15,28,39'
@attribute Opponent {Florida_Gulf_Coast,Vermont,Penn,Rider,Toledo,Saint_Joseph,Fairleigh_Dickinson,Villanova,Syracuse,Temple,George_Mason,Georgia_State,UNC_Wilmington,James_Madison,Hofstra,Old_Dominion,Northeastern,Delaware,VCU,Towson}
@attribute Opp_Steals numeric
@attribute Opp_Field_Goal_Pct numeric
@attribute Def_Rebounds numeric
@attribute Opp_Free_Throws_Made numeric
@attribute Outcome {Win,Loss}
@data
Are these the same features selected at each fold of the cross validation? Can different features be selected depending on the instances split?
Upvotes: 3
Views: 1496
Reputation: 189
your first ("global") feature selection was using all data points including all labels, ie. it had access to class info you would not have access to during cross-validation. Therefore your first approach is flawed resulting in a too good error estimate. Your second approach is correct. It performs worse because it most likely does not select the same five features for every one of the ten runs during cross-validation. hth Bernhard
Upvotes: 2