Vito
Vito

Reputation: 426

ID3 binary tree or non-binary tree?

I'm trying to implement ID3 or C4.5 algorithm.

According to ID3 algorithm, the information gain is calculated as follows:

enter image description here

For example: training data like this:

credit      age      label   
normal     young      yes
normal     old      yes
bad        old        no
excellent  middle     yes 

The IG of credit should like this: IG(credit) = H(D) - P(credit==normal)H(D|credit==normal) - P(credit==bad)H(D|credit==bad) - P(credit==excellent)H(D|credit==excellent)

When I choose the credit as the best feature to split, in the following procedure, I will not consider the attribute "credit" again.

However: I also see some one implemented like this: IG(credit=normal) = H(D) - P(credit==normal)H(D|credit==normal) - P(credit ~= normal)H(D|credit ~= normal)

When I choose credit == normal as the best feature to split, in the following procedure, I will consider the attribute "credit" again, like credit == "bad".

The resulting tree of different IG calculation procudure, one is non-binary tree, the other is the binary tree.

My question is whether two trees are equivalent? When I do testing on two trees, the results will always be the same? Or one is better than other? Or hard to say which is better, just depends on the data?

Upvotes: 0

Views: 1122

Answers (1)

data_person
data_person

Reputation: 4490

As you have mentioned, one tree will perform multiway split the other binary split. The 2 trees are definitely NOT equivalent, hence the test results will also not be the same.But the accuracy in both cases could be in a similar range. To suggest you on the last 2 questions regarding which model is better depends on your data.

Upvotes: 0

Related Questions