Clint Whaley
Clint Whaley

Reputation: 479

finding best attribute for decision tree

Why do we want the “best” attribute to be selected? Will there be any difference if we select from any other attributes?

Upvotes: 4

Views: 4392

Answers (3)

mishless
mishless

Reputation: 169

First, lets clear what does the "best" attribute mean in the light of decision trees - this is the attribute that "best" classifies the available training examples. There are two terms one needs to be familiar with in order to define the "best" - entropy and information gain. Entropy is a term from information theory - it is a number that represents how heterogeneous a set of examples is based on their target class. Anther perspective on entropy - it is the number of bits needed to encode the class of a random example from a set of examples. Information gain, on the other hand, shows how much the entropy of a set of examples will decrease if a specific attribute is chosen. The alternative perspective - it shows the decrease in the number of bits that will be needed to represent the class of a random example if a specific attribute is chosen.

So, why do we choose the "best" attribute based on the training examples? The simple answer is because this is how the algorithm for constructing decision tree works - it searches through all possible decision trees and chooses the first one that classifies the training examples correctly employing simple-to-complex search. Since the basic implementation doesn't include any mechanism for revisiting and changing earlier decisions it makes sense to use greedy approach instead of random one.

Here is a simple example to demonstrate what are the consequences of not choosing the "best" attribute when constructing a decision tree. Lets assume we have the following training examples with attributes Exam, Friends , Weather and target Activity. These examples describe preferred activity based on whether there is an exam soon, whether friends are available and whether the weather is sunny or rainy.

╔══════╦═════════╦═════════╦══════════╗
║ Exam ║ Friends ║ Weather ║ Activity ║
╠══════╬═════════╬═════════╬══════════╣
║ yes  ║ yes     ║ sunny   ║ study    ║
║ no   ║ yes     ║ sunny   ║ picnic   ║
║ yes  ║ no      ║ rain    ║ study    ║
║ yes  ║ yes     ║ rain    ║ study    ║
║ no   ║ yes     ║ rain    ║ play     ║
║ no   ║ no      ║ rain    ║ play     ║
╚══════╩═════════╩═════════╩══════════╝

When we do the math we end up with the following numbers for information gain:

IG(D, Exam) ~ 1
IG(D, Friends) ~ 0.13
IG(D, Weather) ~ 0.46

The "best" attribute to choose for a root of the decision tree is Exam. The next step is to decide which attribute to choose ti inspect when there is an exam soon and when there isn't. When there is an exam soon the activity is always study, so there is not need for further exploration. When there is not an exam soon, we need to calculate the information gain for choosing Friends and Weather:

IG(D-No-Exam, Friends) ~ 0.25
IG(D-No-Exam, Weather) ~ 0.92

Following the same strategy as before we would choose Weather and finally the decision tree will look like this:

      Exam?
      /  \
    yes   no
    /      \
 STUDY     Weather?
            /   \
         sunny  rain 
          /       \
       PICNIC     PLAY

Now lets construct a decision tree that classifies the examples but using different root - Friends and randomly choose attribute where needed. We can end up with the following tree:

                Friends?
                /     \
              yes      no
             /          \
          Exam?         Exam?
          /  \           /   \
        yes   no       yes   no
        /      \        |     |
     STUDY   Weather?  STUDY  PLAY
               /   \
            sunny  rain
             /       \
          PICNIC    PLAY

Both trees classify the training examples correctly. The difference is that the second tree is more complex and may be over-fit to the training data. The algorithm for constructing decision trees by choosing always the best attributes "prefers" trees that are shorter and less complex and trees that exam the best attributes first.

Upvotes: 8

user3085931
user3085931

Reputation: 1803

As always depends on the case. But mostly you want to execute the tree as fast as possible, so you want to avoid unnecessary decisions/branches. So you chose the best feature (and corresponding split position), while otheres need e.g. 2-3 more branches to finally predict a sample.

It's easy to see when processing a tree with 10 branches in general will be a lot quicker than another one with 30 branches

Upvotes: 0

Nate
Nate

Reputation: 2492

A common way to determine which attribute to choose in decision trees is information gain. Basically, you try each attribute and see which one splits your data best. Check out page 6 of this deck: http://homes.cs.washington.edu/~shapiro/EE596/notes/InfoGain.pdf

Upvotes: 0

Related Questions