cmac
cmac

Reputation: 127

How to obtain tree-level, out-of-bag predictions from a random forest using the ranger package?

For each observation in a data frame that trains a random forest model, there is a set of trees (of size ~1/3 of the total number of forest trees) for which that observation was not in-bag. I would like to get a measure of spread of such out-of-bag, tree-level predictions at each observation, ideally by retrieving a prediction from each tree.

Is there a way to do this for random forest models fit using the ranger package in R?

library(ranger)
data("iris")

iris_train <- sample(1:nrow(iris), size=floor(nrow(iris)*0.8))
new_data <- setdiff(1:nrow(iris), iris_train)

rf <- ranger::ranger(formula=Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species,
                     data=iris[iris_train,])
# OOB predictions (average only):
rf$predictions

Note that for new data, it is possible to get tree-level predictions from a random forest model using predict.ranger(..., predict.all=TRUE). I do not see such an option for returning in-sample but out-of-bag tree-level predictions.

# New data predictions (all trees):
p <- predict(rf, iris[new_data,], predict.all = TRUE)

Upvotes: 1

Views: 597

Answers (1)

cmac
cmac

Reputation: 127

The way to do this is to make sure to set keep.inbag=TRUE when running the random forest.

rf <- ranger::ranger(formula=Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species,
                     data=iris[iris_train,],
                     keep.inbag=TRUE)

The inbag.counts gives us, for each tree, a vector of how many times each observation was used in the tree. We can use this to "mask" predictions back to the whole data set.

ibcs <- rf$inbag.counts

# Convert to an n-observation by n-trees matrix:
ibcs <- do.call(cbind, ibcs)


# Get predictions from all trees
preds <- predict(rf, iris[iris_train,], predict.all = TRUE)$predictions

# Set in-bag predictions to NA using the ibcs matrix
preds[which(ibcs > 0)] <- NA

Check that the average across rows gets the same result as predictions

all.equal(rf$predictions, rowMeans(preds, na.rm=TRUE))

Upvotes: 1

Related Questions