Weka features selection (InfoGainAttributeEval, ChiSquaredAttributeEval)

Question

I am new to Weka, I have 2-class data to classify. I could classify it using the weighting (word occurrences, TFIDF or word presence ). I wanted to improve the accuracy of the classifier using the feature selection mechanism integrated in Weka as follows:

                BufferedReader trainReader = new BufferedReader(new FileReader(dataSource)); 
               trainInsts = new Instances(trainReader); 
               trainInsts.setClassIndex(trainInsts.numAttributes() - 1); 


// I am using the filter to convert the data from string to numeric 

                  StringToWordVector STWfilter = new StringToWordVector(); 
                  FilteredClassifier model = new FilteredClassifier(); 
                  model.setFilter(STWfilter); 
                  STWfilter.setOutputWordCounts(true); 


int n = 400; // number of features to select 
      AttributeSelection attributeSelection = new  AttributeSelection(); 
     ranker = new Ranker(); 
     ranker.setNumToSelect(n); 

       infoGainAttributeEval = new InfoGainAttributeEval(); 
       attributeSelection.setEvaluator(infoGainAttributeEval); 
attributeSelection.setSearch(ranker); 

     attributeSelection.setInputFormat(trainInsts); 
     trainInsts = Filter.useFilter(trainInsts, attributeSelection); 

 Evaluation eval = new Evaluation(trainInsts); 

        eval.crossValidateModel(model, trainInsts, folds, new Random(1));

This works and I could see slight improvements against using the standard weighting methods such as (word occurrence). I am not sure if what I did is correct. Because I feel the feature selection method is same as the weighting methods. Also must I give the "n" number of feature I should have? this is influence the result of the classifier significantly, how this can be set, for example when I have 3000 instances, how many feature I should select? also is there any way in Weka to obtain the number of feature (word) I have in my data? for example with 2000 instances, the best accuracy was with n=400 .

Any comments?

Thanks in advance

Jose Maria Gomez Hidalgo · Accepted Answer

Aswering your questions one by one:

"Because I feel the feature selection method is same as the weighting methods." Well, they are much different. First, weighting is not supervised, it does not take into account the class information; while feature selection is supervised, and aims at selecting the most predictive features (words) according to a quality metric that measures the "coupling" between the feature and the class. Second, most feature selection metrics (like e.g. Information Gain) do not take into account weights, only occurences - they show the same results (scores) for TFxIDF or for binary features.
"Also must I give the "n" number of feature I should have?" In WEKA you can set up the threshold for the minimum score that a feature must get in the quality metric. A generic value that will help you independently of the number of instances is 0.0. This means that all features scoring more than 0.0 will be kept, as they provide at least a bit of predictive information. You can raise that score upto 1.0 in the case of Information Gain; the higher the threshold, the less features you will keep. Additionally, a rule of thumb that has been used in the literature on text classification (see e.g. Yang & Pedersen paper) is keeping around 1-10% of the features. In Information Retrieval, Salton stated that those terms with a Document Frequency of 1 to 10% of the number of documents were more discriminant (but Information Retrieval is about search, which is not supervised).

So, summarizing: you are doing it right -- keep on with attribute selection, but for simplicity, state 0.0 as the minimum threshold for Information Gain.

Weka features selection (InfoGainAttributeEval, ChiSquaredAttributeEval)

Answers (1)

Related Questions