Reputation: 939
I have run Classification and Regression Trees (trees.REPTree) on the housing.arff data (with 66% Percentage Split). This is the outcome.
REPTree
============
RM < 6.84
| LSTAT < 14.8
| | LSTAT < 9.75 : 25.15 (88/21.02) [47/55.38]
What do values at leaves (25.15, 88/21.02 etc) mean?
Upvotes: 2
Views: 1292
Reputation: 2608
For completeness sake, here is a copy of Eibe Frank's answer from the Weka mailing list (dating 2015/01/21):
Remember that REPTree splits the data into a growing set and a pruning set (unless you turn pruning off).
Let’s say you have
(A/B) [C/D]
The meaning of this expression depends on whether you are doing regression (your case) or classification.
Regression case
Classification case
The error will normally be larger on the pruning set than on the growing set, as in your case.
Note that A, B, C, and D are calculated before backfitting, which is the last step in the REPTree algorithm that happens after growing and pruning the tree. During backfitting, the data from the pruning set is used to update the predictions made at the leaf nodes, so that these are based on the full, combined data.
The predictions at the leaf nodes shown in the output are those obtained after backfitting.
Upvotes: 1
Reputation: 6573
For classifying nominal data, https://www.analyticsvidhya.com/blog/2020/03/decision-tree-weka-no-coding/ says these are artifacts of the REPTree (Reduced Error Pruning Tree) algorithm.
This aligns with @zbicyclist's answer.
Upvotes: 0
Reputation: 696
I've tried to reverse-engineer an answer, and if I get more definitive information I will update this.
I ran a very small tree on a Toyota Corolla dataset (predicting price of a used car). Here is the tree:
Age_08_04 < 32.5
| Weight < 1297.5 : 18033.54 (121/6009564.12) [59/6768951.55]
| Weight >= 1297.5 : 27945.83 (3/10945416.67) [3/22217291.67]
Age_08_04 >= 32.5
| Age_08_04 < 57.5 : 11363.26 (296/2827594.01) [144/2999066.05]
| Age_08_04 >= 57.5 : 8636.94 (537/1487597.91) [273/1821232.47]
The first numbers in the leaf nodes (18033, 27945, 11363, 8636) are the predicted prices for these cars. The second and fourth numbers add up to the number of instances: 121 + 59 + 3 + 3 ...+273 = 1436, the number of instances in the entire set. The second numbers add up to 957 (two third of the instances) and the fourth numbers add up to 479 (one third of the instances).
Witten et al's book (Data Mining: Practical Machine Learning Tools and Techniques, 4th edition), in section 6.1 (Decision Trees: Error Estimating Error Rates) notes
"One way of coming up with an error estimate is the standard verification technique: hold back some of the data originally given and use it as an independent test set to estimate the error at each node. This is called reduced-error pruning." (Kindle location 5403)
So I think it's doing that 2/3, 1/3 split on the data, even though we're also doing 10-fold cross-validation.
The third and fifth numbers (after the /) seem to be MSEs. Doing a bit of algebra, the weighted average of the fifth numbers is consistent with the Root mean squared error and Root relative squared error reported in the cross-validation summary. (not quite exact, but I don't think I'd expect it to be)
Again, if I find out more information I will update this answer -- and I'd be happy to get more definitive information from others.
Upvotes: 2