Reputation: 11
I am running gbm function (from GBM R package) and I am setting the option train.fraction to 0.7. I would like to get a vector with the response variable corresponding to this subset. I though this must be saved in one of the variables of the output gbm object but I haven't found it and I don't know if there is a way to get it. The data fraction used is saved in gbm.result$data$x.ordered but it does not include the response variable. Apologies if this has a very obvious answer.
Upvotes: 0
Views: 115
Reputation: 46978
It takes the first 0.7*nrows of your data if you specify training.fraction = 0.7
If you check out the gbm function:
train.fraction: The first ‘train.fraction * nrows(data)’ observations
are used to fit the ‘gbm’ and the remainder are used for
computing out-of-sample estimates of the loss function.
We can verify this by checking the training error and valid error:
train.error: a vector of length equal to the number of fitted trees
containing the value of the loss function for each boosting
iteration evaluated on the training data
valid.error: a vector of length equal to the number of fitted trees
containing the value of the loss function for each boosting
iteration evaluated on the validation data
For example:
library(gbm)
set.seed(111)
data = iris[sample(nrow(iris)),]
data$Species=as.numeric(data$Species=="versicolor")
fit = gbm(Species ~ .,data=data,train.fraction=0.7,distribution="bernoulli")
Since 0.7*150 = 105, we will write a function to calculate deviance (can refer to this for the derivation) and check the respective deviance:
# here y is the observed label, 0 or 1
# P is the log-odds obtained from predict.gbm(..)
b_dev = function(y,P){-2*mean(y*P-log(1+exp(P)))}
fit$train.error[length(fit$train.error)]
[1] 0.1408239
b_dev(data$Species[1:105],predict(fit,data[1:105,],n.trees=fit$n.trees))
[1] 0.1408239
fit$valid.error[100]
[1] 0.365474
b_dev(data$Species[106:150],predict(fit,data[106:150,],n.trees=fit$n.trees))
[1] 0.365474
Upvotes: 1