goldisfine
goldisfine

Reputation: 4850

How to generate a prediction interval from a regression tree rpart object?

How do you generate a prediction interval from a regression tree that is fit using rpart?

It is my understanding that a regression tree models the response conditional on the mean of the leaf nodes. I don't know how to get the variance for a leaf node from the model, but what I would like to do is simulate using the mean and variance for a leaf node to obtain a prediction interval.

Predict.rpart() doesn't give an option for interval.

Example: I fit a tree with iris data, but predict doesn't have an option, "interval"

> r1 <- rpart(Sepal.Length ~ ., cp = 0.001, data = iris[1:nrow(iris)-1,])
> predict(r1,newdata=iris[nrow(iris),],type = "interval")
Error in match.arg(type) : 
  'arg' should be one of “vector”, “prob”, “class”, “matrix”

Upvotes: 12

Views: 7347

Answers (2)

cmbarbu
cmbarbu

Reputation: 4534

It is not clear to me what confidence intervals would mean for regression trees as those are not classical statistical models like linear models. And I see mainly two uses: characterising the certainty of your tree or characterizing the precision of the prediction for each leaf of the tree. Hereafter an answer for each of these possibilities.

Characterizing the certainty of your tree

If you are looking for a confidence value for a split node, then party provides that directly as it uses permutation tests and statistically determine which variables are most important and the p-value attached to each split. A significant superiority of party's ctree function over rpart as explained here.

Confidence intervals for set leafs of the regression tree

Third, if you are looking for a confidence of interval for the value in each leaf, then the [0.025,0.975] quantiles interval for the observations in the leaf is most likely what you are looking for. The default plots in party takes a similar approach when displaying boxplots for the output value for each leaf:

library("party")
r2 <- ctree(Sepal.Length ~ .,data=iris)
plot(r2)    

example party tree

Retrieving the corresponding intervals can simply be done by:

iris$leaf <- predict(r2, type="node")
CIleaf <- aggregate(iris$Sepal.Length,
                    by=list(leaf=iris$leaf),
                    quantile,
                    prob=c(0.025, 0.25, 0.75, 0.975))

And it's easy to visualize:

plot(as.factor(CIleaf$leaf), CIleaf[, 2],
     ylab="Sepal length", xlab="Regression tree leaf")
legend("bottomright",
       c(" 0.975 quantile", " 0.75 quantile", " mean", 
         " 0.25 quantile", " 0.025 quantile"),
       pch=c("-", "_", "_", "_", "-"),
       pt.lwd=0.5, pt.cex=c(1, 1, 2, 1, 1), xjust=1)

Sepal length variance per regression tree leaf

Upvotes: 13

Jeff
Jeff

Reputation: 738

Perhaps one option is a simple bootstrap of your training data?

library(rpart)
library(boot)

trainData <- iris[-150L, ]
predictData <- iris[150L, ]

rboot <- boot(trainData, function(data, idx) {
            bootstrapData <- data[idx, ]
            r1 <- rpart(Sepal.Length ~ ., bootstrapData, cp = 0.001)
            predict(r1, newdata = predictData)
        }, 1000L)

quantile(rboot$t, c(0.025, 0.975))
    2.5%    97.5% 
5.871393 6.766842 

Upvotes: 4

Related Questions