Reputation: 117
I'm trying to create a decision tree with C4.5 algorithm for a school project. The decision tree is for Haberman's Survival Data Set, attribute information is as follows.
Attribute Information:
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
1 = the patient survived 5 years or longer
2 = the patient died within 5 year
And we need to implement a decision tree where each leaf has to have one distinct result (meaning the entropy of that leaf should be 0), however there are six instances where there is the same attributes, but different results.
For example:
66,58,0,2
66,58,0,1
What does C4.5 algorithm do in these type of situations, I've searched everywhere but couldn't find any information.
Thanks.
Upvotes: 1
Views: 266
Reputation: 635
Read Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. (It is a good to study C4.5, if you have college assignment)
From what I studied. it seems like on page 137, source code listing build.c
There is a line of
//* if all case are the same.... or there are not enough case to divide
(like your question)
it will return Node
This Node come from
Node = Leaf(ClassFreq, BestClass, Cases, Cases-NoBestClass);
ClassFreq store the count of each classes
BestClass store which is the dominant class (most freq) Cases store how many data is there
NoBestClass store how many data of BestClass
This Leaf function comes from fileTrees.c
this leaf function will return a node with leaf ofbestClass (Best class become the leaf)
.
All of this information reference on Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
Anyone that has knowledge of this, please comment if I have made something wrong. Thankss
Upvotes: 0