Renaud
Renaud

Reputation: 4668

Meta filtered classifier and manually filtered classifiers give different results

I get contradicting results with two methods that in my mind should produce the same output. Could someone point out what the differences are because I can't get my head around it :S

I am working on Drexel_Stats.arff. I use a 1-NN classifier with 10 fold cross validation. Without any preprocessing this is the confusion matrix I get:

  a  b   <-- classified as
 14  3 |  a = Win
  5  1 |  b = Loss

To get better results I used:

 weka.attributeSelection.InfoGainAttributeEval
 weka.attributeSelection.Ranker -T -1.0 -N 5

to get the 5 most discriminating features of the data set. Then I manually got rid of all the other features and re-run my 1-NN and I got these results:

  a  b   <-- classified as
 16  1 |  a = Win
  1  5 |  b = Loss

Now that's where it gets confusing (to me at least). I tried to use a meta filtered classifier to save the hassle of manually discarding features. Here is what I used (copied from the GUI):

 weka.classifiers.meta.FilteredClassifier
-F "weka.filters.supervised.attribute.AttributeSelection
-E \"weka.attributeSelection.InfoGainAttributeEval \"
-S \"weka.attributeSelection.Ranker -T -1.0 -N 5\""
-W weka.classifiers.lazy.IB1 -D

I understand this as being an automation of the previous operation but in fact the results I get this time are different:

  a  b   <-- classified as
 15  2 |  a = Win
  4  2 |  b = Loss

What did I get wrong?

Thanks

edit: Here is part of the WEKA output:

    === Attribute Selection on all input data ===

Search Method:
    Attribute ranking.

Attribute Evaluator (supervised, Class (nominal): 39 Outcome):
    Information Gain Ranking Filter

Ranked attributes:
 0.828    1 Opponent
 0.469   38 Opp_Steals
 0.42    24 Opp_Field_Goal_Pct
 0.331   15 Def_Rebounds
 0.306   28 Opp_Free_Throws_Made

Selected attributes: 1,38,24,15,28 : 5


Header of reduced data:
@relation 'Basketball_Statistics-weka.filters.unsupervised.attribute.Remove-V-R1,38,24,15,28,39'

@attribute Opponent {Florida_Gulf_Coast,Vermont,Penn,Rider,Toledo,Saint_Joseph,Fairleigh_Dickinson,Villanova,Syracuse,Temple,George_Mason,Georgia_State,UNC_Wilmington,James_Madison,Hofstra,Old_Dominion,Northeastern,Delaware,VCU,Towson}
@attribute Opp_Steals numeric
@attribute Opp_Field_Goal_Pct numeric
@attribute Def_Rebounds numeric
@attribute Opp_Free_Throws_Made numeric
@attribute Outcome {Win,Loss}

@data

Are these the same features selected at each fold of the cross validation? Can different features be selected depending on the instances split?

Upvotes: 3

Views: 1496

Answers (1)

user988621
user988621

Reputation: 189

your first ("global") feature selection was using all data points including all labels, ie. it had access to class info you would not have access to during cross-validation. Therefore your first approach is flawed resulting in a too good error estimate. Your second approach is correct. It performs worse because it most likely does not select the same five features for every one of the ten runs during cross-validation. hth Bernhard

Upvotes: 2

Related Questions