Reputation: 75
I have confusion about the numbers at the end of the branches of a J48 tree. For example, using the weather.nominal data the tree looks the same, whether the Test options are set to Use training set or Cross-validation or Percentage split. This is the output:
J48 pruned tree
------------------
outlook = sunny
| humidity = high: no (3.0)
| humidity = normal: yes (2.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
According to the textbook by the authors of this software, in an example using this exact data they say, "In the tree structure, a colon introduces the class label that has been assigned to a particular leaf, followed by the number of instances that reach that leaf, expressed as a decimal number because of the way the algorithm uses fractional instances to handle missing values. If there were incorrectly classified instances (there aren’t in this example) their number would appear, too: thus 2.0/1.0 means that two instances reached that leaf, of which one is classified incorrectly" So this means that no instances were incorrectly classified in the above tree with the weather.nominal dataset. On the other hand, when the test options are set to either 'Use training set' or 'Percentage split' (with the default random seed), there are incorrectly classified instances. For example, with a 60 percentage split, it shows the following
=== Evaluation on test split ===
=== Summary ===
Correctly Classified Instances 2 40 %
Incorrectly Classified Instances 3 60 %
There seems to be a contradiction here but I must be missing something. Is the tree shown initially not the tree that is built with the 60 percentage split? That is not stated anywhere as far as I have seen but I can't think of any other explanation.
Just for completeness, the data is here:
outlook,temperature,humidity,windy,play
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
Upvotes: 0
Views: 80
Reputation: 2608
If you take a closer look at the output, you will see the following:
=== Classifier model (full training set) ===
The model that is being depicted there is the model that was trained on the full dataset, not your split.
The next section has the following heading:
=== Evaluation on test split ===
The statistics that you are referring to are based on a model trained and evaluated on your dataset split.
Upvotes: 0