Reputation: 40909
I used Weka to successfully build a J48 (C4.5) decision tree. I would now like to evaluate how effective or important my features are.
One obvious way is to loop through all the features, remove one at a time, and re-run classification tests each time to see which feature has the largest drop in classification accuracy. However, this may hide co-dependencies between features.
However, I am thinking of another approach based on understanding the C4.5 algorithm. Since each split in the tree is based on a maximum information gain decision, a split on a feature closer to the root of the tree must mean that the feature had more information gain than a split with a different feature lower in the tree. So for a given feature F that occurs in several splits within the tree, I can calculate F's average distance away from the root. I can then rank all the features by average distance, with the lowest average being the most valuable feature. Would this be a correct approach?
Upvotes: 3
Views: 9312
Reputation: 71
Bit of a necro post... but here goes...
I'm assuming the reason you want to know the importance of the attributes is so you can build a better tree by only using the relevant attributes.
If that is the case, you could always use the meta classifier "AttributeSelectedClassifier" then put j48 as the classifier.
you must then choose an evaluator for attribute subsets and a search method. For instance, I'm currently experimenting with the "WrapperSubsetEval" evaluator and the "GeneticSearch" search algorithm.
For the wrapper evaluation you need to choose a classifier (it will actually build the classifier to see how well it does on each attribute subset tested by the search), which in my case I'm using j48 (matching it with the classifier I want to use the attribute set with).
With these settings, it will evolve a subset of attributes (with a genetic algorithm) which works well with the j48 algorithm, then it will run the j48 on your data with that evolved attribute set.
This is computationally expensive as it must build and test many trees, but it can give good results (and much faster then trying to do it by hand) :)
Upvotes: 4
Reputation: 6087
You could try the "attribute selection" tab. There, you can perform a PCA analysis, a CfsSubsetEval + BestFirst... in order to determine which are the best features.
Another (but manual) way, is to train and test the same algorithm with different attributes and check the results statistically using T-test, in order to determine if the improvement is statistically significant.
Upvotes: 2