Reputation: 127
For each observation in a data frame that trains a random forest model, there is a set of trees (of size ~1/3 of the total number of forest trees) for which that observation was not in-bag. I would like to get a measure of spread of such out-of-bag, tree-level predictions at each observation, ideally by retrieving a prediction from each tree.
Is there a way to do this for random forest models fit using the ranger
package in R?
library(ranger)
data("iris")
iris_train <- sample(1:nrow(iris), size=floor(nrow(iris)*0.8))
new_data <- setdiff(1:nrow(iris), iris_train)
rf <- ranger::ranger(formula=Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species,
data=iris[iris_train,])
# OOB predictions (average only):
rf$predictions
Note that for new data, it is possible to get tree-level predictions from a random forest model using predict.ranger(..., predict.all=TRUE)
. I do not see such an option for returning in-sample but out-of-bag tree-level predictions.
# New data predictions (all trees):
p <- predict(rf, iris[new_data,], predict.all = TRUE)
Upvotes: 1
Views: 597
Reputation: 127
The way to do this is to make sure to set keep.inbag=TRUE
when running the random forest.
rf <- ranger::ranger(formula=Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species,
data=iris[iris_train,],
keep.inbag=TRUE)
The inbag.counts
gives us, for each tree, a vector of how many times each observation was used in the tree. We can use this to "mask" predictions back to the whole data set.
ibcs <- rf$inbag.counts
# Convert to an n-observation by n-trees matrix:
ibcs <- do.call(cbind, ibcs)
# Get predictions from all trees
preds <- predict(rf, iris[iris_train,], predict.all = TRUE)$predictions
# Set in-bag predictions to NA using the ibcs matrix
preds[which(ibcs > 0)] <- NA
Check that the average across rows gets the same result as predictions
all.equal(rf$predictions, rowMeans(preds, na.rm=TRUE))
Upvotes: 1