Decision tree completeness and unclassified data

Question

I made a program that trains a decision tree built on the ID3 algorithm using an information gain function (Shanon entropy) for feature selection (split). Once I trained a decision tree I tested it to classify unseen data and I realized that some data instances cannot be classified: there is no path on the tree that classifies the instance.

An example (this is an illustration example but I encounter the same problem with a larger and more complex data set):

Being f1 and f2 the predictor variables (features) and y the categorical variable, the values ranges are:
- f1: [a1; a2; a3]
- f2: [b1; b2; b3]
- y : [y1; y2; y3]

Training data:

("a1", "b1", "y1");
("a1", "b2", "y2");
("a2", "b3", "y3");
("a3", "b3", "y1");

Trained tree:

     [f2] 
    / |  \ 
  b1  b2  b3 
  /    |   \ 
 y1   y2   [f1] 
           /  \ 
          a2   a3 
         /      \ 
        y3       y1

The instance ("a1", "b3") cannot be classified with the given tree. Several questions came up to me:

Does this situation have a name? tree incompleteness or something like that?
Is there a way to know if a decision tree will cover all combinations of unknown instances (all features values combinations)?
Does the reason of this "incompleteness" lie on the topology of the data set or on the algorithm used to train the decision tree (ID3 in this case) (or other)?
Is there a method to classify these unclassifiable instances with the given decision tree? or one must use another tool (random forest, neural networks...)?

blazs · Accepted Answer

This situation cannot occur with the ID3 decision-tree learner---regardless of whether it uses information gain or some other heuristic for split selection. (See, for example, ID3 algorithm on Wikipedia.)

The "trained tree" in your example above could not have been returned by the ID3 decision-tree learning algorithm.

This is because when the algorithm selects a d-valued attribute (i.e. an attribute with d possible values) on which to split the given leaf, it will create d new children (one per attribute value). In particular, in your example above, the node [f1] would have three children, corresponding to attribute values a1,a2, and a3.

It follows from the previous paragraph (and, in general, from the way the ID3 algorithm works) that any well-formed vector---of the form (v1, v2, ..., vn, y), where vi is a value of i-th attribute and y is the class value---should be classifiable by the decision tree that the algorithm learns on a given train set.

Would you mind providing a link to the software you used to learn the "incomplete" trees?

To answer your questions:

Not that I know of. It doesn't make sense to learn such "incomplete trees." If we knew that some attribute values will never occur then we would not include them in the specification (the file where you list attributes and their values) in the first place.
With the ID3 algorithm, you can prove---as I sketched in the answer---that every tree returned by the algorithm will cover all possible combinations.
You're using the wrong algorithm. Data has nothing to do with it.
There is no such thing as an unclassifiable instance in decision-tree learning. One usually defines a decision-tree learning problem as follows. Given a train set S of examples x1,x2,...,xn of the form xi=(v1i,v2i,...,vni,yi) where vji is the value of the j-th attribute and yi is the class value in example xi, learn a function (represented by a decision tree) f: X -> Y, where X is the space of all possible well-formed vectors (i.e. all possible combinations of attribute values) and Y is the space of all possible class values, which minimizes an error function (e.g. the number of misclassified examples). From this definition, you can see that one requires that the function f is able to map any combination to a class value; thus, by definition, each possible instance is classifiable.

Decision tree completeness and unclassified data

Answers (1)

Related Questions