Random Forest Regression: Extracting the training samples in the terminal nodes of each tree

Question

I want to implement the Predictive Prescription approach from Bertsimas et al. (2020) where they combine a machine learning approach with optimization. For that, I need to look at the terminal nodes (disjuct regions) of each decision tree in the forest.

Specifically, I want to know the following things for each tree:

In which region do the training samples fall?
To which region do the test samples belong?

I hope my question becomes clearer with the following picture of one decision tree:

Regression Tree Example

Here, for the first terminal node, I am not interested in the prediction m but rather in the values y1, y4 and y5 that form the basis for the prediction.

The perfect result would be a matrix-like structure, where each column represents one tree and each row represents one training (test) sample. For each sample and tree, the structure should give me the ID of the region/terminal node where the sample can be found!

I looked at the randomForest as well as the ranger package but had no luck finding anything relevant... some paper mentioned implementing the method with the caret package, but they didn't mention anything on how to bypass the prediction.

Here's a reproducible regression example using ranger:

library(MASS)
library(e1071)
library(ranger)

#load data
data(Boston)
set.seed(111)
ind <- sample(2, nrow(Boston), replace = TRUE, prob=c(0.8, 0.2))

train <- Boston[ind == 1,]
test <- Boston[ind == 2,]

#train random forest
boston.rf <- ranger(medv ~ ., data = train)

Any help is highly appreciated. Cheers!

Random Forest Regression: Extracting the training samples in the terminal nodes of each tree

Answers (1)

Related Questions