grammar
grammar

Reputation: 939

Weka - Classification and Regression Trees

I have run Classification and Regression Trees (trees.REPTree) on the housing.arff data (with 66% Percentage Split). This is the outcome.

REPTree
============

RM < 6.84
|   LSTAT < 14.8
|   |   LSTAT < 9.75 : 25.15 (88/21.02) [47/55.38]

What do values at leaves (25.15, 88/21.02 etc) mean?

Upvotes: 2

Views: 1292

Answers (3)

fracpete
fracpete

Reputation: 2608

For completeness sake, here is a copy of Eibe Frank's answer from the Weka mailing list (dating 2015/01/21):

Remember that REPTree splits the data into a growing set and a pruning set (unless you turn pruning off).

Let’s say you have

  (A/B) [C/D]

The meaning of this expression depends on whether you are doing regression (your case) or classification.

Regression case

  • A: total weight of all instances in the growing set that end up in this leaf
  • B: average squared error for all instances in the growing set that end up in this leaf (taking instance weights into account)
  • C: total weight of all instances in the pruning set that end up in this leaf
  • D: average squared error for all instances in the pruning set that end up in this leaf (taking instance weights into account)

Classification case

  • A: same as A above
  • B: total weight of all incorrectly classified instances in the growing set that end up in this leaf
  • C: same as C above
  • D: total weight of all incorrectly classified instances in the pruning set that end up in this leaf

The error will normally be larger on the pruning set than on the growing set, as in your case.

Note that A, B, C, and D are calculated before backfitting, which is the last step in the REPTree algorithm that happens after growing and pruning the tree. During backfitting, the data from the pruning set is used to update the predictions made at the leaf nodes, so that these are based on the full, combined data.

The predictions at the leaf nodes shown in the output are those obtained after backfitting.

Upvotes: 1

Glenn
Glenn

Reputation: 6573

For classifying nominal data, https://www.analyticsvidhya.com/blog/2020/03/decision-tree-weka-no-coding/ says these are artifacts of the REPTree (Reduced Error Pruning Tree) algorithm.

  • The value before the parenthesis denotes the classification value
  • The first value in the first parenthesis is the total number of instances from the training set in that leaf. The second value is the number of instances incorrectly classified in that leaf
  • The first value in the second parenthesis is the total number of instances from the pruning set in that leaf. The second value is the number of instances incorrectly classified in that leaf

This aligns with @zbicyclist's answer.

Upvotes: 0

zbicyclist
zbicyclist

Reputation: 696

I've tried to reverse-engineer an answer, and if I get more definitive information I will update this.

I ran a very small tree on a Toyota Corolla dataset (predicting price of a used car). Here is the tree:

Age_08_04 < 32.5
|   Weight < 1297.5 : 18033.54 (121/6009564.12) [59/6768951.55]
|   Weight >= 1297.5 : 27945.83 (3/10945416.67) [3/22217291.67]
Age_08_04 >= 32.5
|   Age_08_04 < 57.5 : 11363.26 (296/2827594.01) [144/2999066.05]
|   Age_08_04 >= 57.5 : 8636.94 (537/1487597.91) [273/1821232.47]

The first numbers in the leaf nodes (18033, 27945, 11363, 8636) are the predicted prices for these cars. The second and fourth numbers add up to the number of instances: 121 + 59 + 3 + 3 ...+273 = 1436, the number of instances in the entire set. The second numbers add up to 957 (two third of the instances) and the fourth numbers add up to 479 (one third of the instances).

Witten et al's book (Data Mining: Practical Machine Learning Tools and Techniques, 4th edition), in section 6.1 (Decision Trees: Error Estimating Error Rates) notes

"One way of coming up with an error estimate is the standard verification technique: hold back some of the data originally given and use it as an independent test set to estimate the error at each node. This is called reduced-error pruning." (Kindle location 5403)

So I think it's doing that 2/3, 1/3 split on the data, even though we're also doing 10-fold cross-validation.

The third and fifth numbers (after the /) seem to be MSEs. Doing a bit of algebra, the weighted average of the fifth numbers is consistent with the Root mean squared error and Root relative squared error reported in the cross-validation summary. (not quite exact, but I don't think I'd expect it to be)

Again, if I find out more information I will update this answer -- and I'd be happy to get more definitive information from others.

Upvotes: 2

Related Questions