ŁukaszBachman
ŁukaszBachman

Reputation: 33735

Extract important attributes in Weka

That's rather newbie question, so please take it with a grain of salt.

I'm new in the field of data mining and trying to get my head wrapped around this topic. Right now I'm trying to polish my existing model so that it classifies instances better. The problem is, that my model has around 480 attributes. I know for sure that not all of them are relevant, but it's hard for me point out which are indeed important.

The question is: having valid training and test sets, does one can use some sort of data mining algorithm which would throw away attributes that seem to not have any impact on the quality of classification?

I'm using Weka.

Upvotes: 4

Views: 4126

Answers (4)

roopalgarg
roopalgarg

Reputation: 449

Look into the InfoGainAttributeEval class. The buildEvaluator() and the evaluateAttribute(int index) functions should help.

Upvotes: 0

java_xof
java_xof

Reputation: 439

Comment converted to answer as OP suggested: If You use weka 3.6.6 - select module explorer -> than go to tab "Select attributes" and choose "Attribute evaluator" and "Search method", you can also choose between using full data set or cv sets, for more details see e.g. http://forums.pentaho.com/showthread.php?68687-Selecting-Attributes-with-Weka or http://weka.wikispaces.com/Performing+attribute+selection

Upvotes: 0

Mihai M.
Mihai M.

Reputation: 240

You should test using some of the Classifier algorithms that Weka has.

The basic idea is to use the Cross-validation option, so you can see which algorithm gives you the best Correctly Classified Instances value.

I can give you an example of one of my training set, using the Cross-validation option and choosing Folds 10.

As you can see, using the J48 classifier I will have:

Correctly Classified Instances        4310               83.2207 %
Incorrectly Classified Instances       869               16.7793 %

and if I will use for example the NaiveBayes Algorithm I will have:

Correctly Classified Instances        1996               38.5403 %
Incorrectly Classified Instances      3183               61.4597 %

and so on, the values differ depending on the algorithm.

So, test as many algorithms as possible and see which one gives you the best Correctly Classified Instances / Time consumed.

Upvotes: 2

mcassano
mcassano

Reputation: 506

Read up on the topic of clustering algorithms (only on your training set though!)

Upvotes: 0

Related Questions