Reputation: 33735
That's rather newbie question, so please take it with a grain of salt.
I'm new in the field of data mining and trying to get my head wrapped around this topic. Right now I'm trying to polish my existing model so that it classifies instances better. The problem is, that my model has around 480 attributes. I know for sure that not all of them are relevant, but it's hard for me point out which are indeed important.
The question is: having valid training and test sets, does one can use some sort of data mining algorithm which would throw away attributes that seem to not have any impact on the quality of classification?
I'm using Weka.
Upvotes: 4
Views: 4126
Reputation: 449
Look into the InfoGainAttributeEval class. The buildEvaluator() and the evaluateAttribute(int index) functions should help.
Upvotes: 0
Reputation: 439
Comment converted to answer as OP suggested: If You use weka 3.6.6 - select module explorer -> than go to tab "Select attributes" and choose "Attribute evaluator" and "Search method", you can also choose between using full data set or cv sets, for more details see e.g. http://forums.pentaho.com/showthread.php?68687-Selecting-Attributes-with-Weka or http://weka.wikispaces.com/Performing+attribute+selection
Upvotes: 0
Reputation: 240
You should test using some of the Classifier algorithms that Weka has.
The basic idea is to use the Cross-validation option, so you can see which algorithm gives you the best Correctly Classified Instances value.
I can give you an example of one of my training set, using the Cross-validation option and choosing Folds 10.
As you can see, using the J48 classifier I will have:
Correctly Classified Instances 4310 83.2207 %
Incorrectly Classified Instances 869 16.7793 %
and if I will use for example the NaiveBayes Algorithm I will have:
Correctly Classified Instances 1996 38.5403 %
Incorrectly Classified Instances 3183 61.4597 %
and so on, the values differ depending on the algorithm.
So, test as many algorithms as possible and see which one gives you the best Correctly Classified Instances / Time consumed.
Upvotes: 2
Reputation: 506
Read up on the topic of clustering algorithms (only on your training set though!)
Upvotes: 0