goldisfine
goldisfine

Reputation: 4850

How to get terminal nodes for a new observation from an rpart object?

Say I have

head(kyphosis)
inTrain <- sample(1:nrow(kyphosis), 45, replace = F)
TRAIN_KYPHOSIS <- kyphosis[inTrain,]
TEST_KYPHOSIS <- kyphosis[-inTrain,]

(kyph_tree <- rpart(Number ~ ., data = TRAIN_KYPHOSIS))

How to get the terminal node from the fitted object for each observation in TEST_KYPHOSIS?

How do I get a summary, such as the deviance and the predicted value from the terminal node which each test observation maps to?

Upvotes: 6

Views: 2842

Answers (2)

VitoshKa
VitoshKa

Reputation: 8533

rpart actually has this functionality but it's not exposed (strangely enough, it's a rather obvious requirement).

predict_nodes <-
    function (object, newdata, na.action = na.pass) {
        where <-
            if (missing(newdata)) 
                object$where
            else {
                if (is.null(attr(newdata, "terms"))) {
                    Terms <- delete.response(object$terms)
                    newdata <- model.frame(Terms, newdata, na.action = na.action, 
                                           xlev = attr(object, "xlevels"))
                    if (!is.null(cl <- attr(Terms, "dataClasses"))) 
                        .checkMFClasses(cl, newdata, TRUE)
                }
                rpart:::pred.rpart(object, rpart:::rpart.matrix(newdata))
            }
        as.integer(row.names(object$frame))[where]
    }

And then:

> predict_nodes(kyph_tree, TEST_KYPHOSIS)
 [1] 5 3 4 3 3 5 5 3 3 3 3 5 5 4 3 5 4 3 3 3 3 4 3 4 4 5 5 3 4 4 3 5 3 5 5 5

Upvotes: 8

Achim Zeileis
Achim Zeileis

Reputation: 17193

One option is to convert the rpart object to an object of class party from the partykit package. That provides a general toolkit for dealing with recursive partytions. The conversion is simple:

library("partykit")
(kyph_party <- as.party(kyph_tree))

Model formula:
Number ~ Kyphosis + Age + Start

Fitted party:
[1] root
|   [2] Start >= 15.5: 2.933 (n = 15, err = 10.9)
|   [3] Start < 15.5
|   |   [4] Age >= 112.5: 3.714 (n = 14, err = 18.9)
|   |   [5] Age < 112.5: 5.125 (n = 16, err = 29.8)

Number of inner nodes:    2
Number of terminal nodes: 3

(For exact reproducibility run the code from your question with set.seed(1) prior to running my code.)

For objects of this class there are somewhat more flexible methods for plot(), predict(), fitted(), etc. For example, plot(kyph_party) yields a more informative display than the default plot(kyph_tree). The fitted() method extracts a two-column data.frame with the fitted node numbers and the observed responses on the training data.

kyph_fit <- fitted(kyph_party)
head(kyph_fit, 3)

  (fitted) (response)
1        5          6
2        2          2
3        4          3

With this you can easily compute any quantity you are interested in, e.g., the means, median, or residual sums of squares within each node.

tapply(kyph_fit[,2], kyph_fit[,1], mean)

       2        4        5 
2.933333 3.714286 5.125000 

tapply(kyph_fit[,2], kyph_fit[,1], median)

2 4 5 
3 4 5 

tapply(kyph_fit[,2], kyph_fit[,1], function(x) sum((x - mean(x))^2))

       2        4        5 
10.93333 18.85714 29.75000 

Instead of the simple tapply() you can use any other function of your choice to compute the tables of grouped statistics.

Now to learn which observation from the test data TEST_KYPHOSIS to which node in the tree you can simply use the predict(..., type = "node") method:

kyph_pred <- predict(kyph_party, newdata = TEST_KYPHOSIS, type = "node")
head(kyph_pred)

 2  3  4  6  7 10 
 4  4  5  2  2  5 

Upvotes: 5

Related Questions