Reputation: 31109
I found several examples of two types.
Given a data with only two items classes. For example only blue and yellow balls. I.e. we have only one feature in this case is color. This is clear example to show "divide and conquer" rule applicable to entropy. But this is senselessly for any prediction or classification problems because if we have an object with only one feature and the value is known we don't need a tree to decide that "this ball is yellow".
Given a data with multiple features and a feature to predict (known for training data). We can calculate a predicate based on minimum average entropy for each feature. Closer to life, isn't it? It was clear to me until I haven't tried to implement the algorithm.
And now I have a collision in my mind.
If we calculate entropy relatively to a known features (one per node) we will have meaningful results at classification with a tree only if unknown feature is strictly dependent from every known feature. Otherwise a single unbound known feature could break all prediction driving a decision in a wrong way. But if we calculate entropy relatively to a values of the feature which we want to predict at classification we are returned to the first senseless example. In this way there is no difference which of a known feature to use for a node...
And a question about a tree building process.
Should I calculate entropy only for known features and just believe that all the known features are bound to an unknown? Or maybe I should calculate entropy for unknown feature (known for training data) TOO to determine which feature more affects result?
Upvotes: 1
Views: 547
Reputation: 86
I had the same problem (in maybe a similar programing task) some years ago: Do I calculate the entropy against the complete set of features, the relevant features for a branch or the relevant features for a level?
Turned out like this: In a decision tree it comes down to comparing entropies between different branches to determine the optimal branch. Comparison requires equal base sets, i.e. whenever you want to compare two entropie values, they must be based on the same feature set.
For your problem you can go with the features relevant to the set of branches you want to compare, as long as you are aware that with this solution you cannot compare entropies between different branch sets. Otherwise go with the whole feature set.
(Disclaimer: Above solution is a mind protocol from a problem that lead to about an hour of thinking some years ago. Hopefully I got everything right.)
PS: Beware of the car dataset! ;)
Upvotes: 0