How to get a vector with the response variable from gbm corresponding to the training fraction?

I am running gbm function (from GBM R package) and I am setting the option train.fraction to 0.7. I would like to get a vector with the response variable corresponding to this subset. I though this must be saved in one of the variables of the output gbm object but I haven't found it and I don't know if there is a way to get it. The data fraction used is saved in gbm.result$data$x.ordered but it does not include the response variable. Apologies if this has a very obvious answer.

Upvotes: 0

Views: 115

Answers (1)

StupidWolf
StupidWolf

Reputation: 46978

It takes the first 0.7*nrows of your data if you specify training.fraction = 0.7

If you check out the gbm function:

train.fraction: The first ‘train.fraction * nrows(data)’ observations
          are used to fit the ‘gbm’ and the remainder are used for
          computing out-of-sample estimates of the loss function.

We can verify this by checking the training error and valid error:

train.error: a vector of length equal to the number of fitted trees
          containing the value of the loss function for each boosting
          iteration evaluated on the training data

valid.error: a vector of length equal to the number of fitted trees
          containing the value of the loss function for each boosting
          iteration evaluated on the validation data

For example:

library(gbm)
set.seed(111)
data = iris[sample(nrow(iris)),]
data$Species=as.numeric(data$Species=="versicolor")

fit = gbm(Species ~ .,data=data,train.fraction=0.7,distribution="bernoulli")

Since 0.7*150 = 105, we will write a function to calculate deviance (can refer to this for the derivation) and check the respective deviance:

# here y is the observed label, 0 or 1
# P is the log-odds obtained from predict.gbm(..)

b_dev = function(y,P){-2*mean(y*P-log(1+exp(P)))}

fit$train.error[length(fit$train.error)]
[1] 0.1408239

b_dev(data$Species[1:105],predict(fit,data[1:105,],n.trees=fit$n.trees))
[1] 0.1408239

fit$valid.error[100]
[1] 0.365474
b_dev(data$Species[106:150],predict(fit,data[106:150,],n.trees=fit$n.trees))
[1] 0.365474

Upvotes: 1

Related Questions