screechOwl
screechOwl

Reputation: 28129

R gbm logistic regression

I was hoping to use the GBM package to do logistic regression, but it is giving answers slightly outside of the 0-1 range. I've tried the suggested distribution parameters for 0-1 predictions (bernoulli, and adaboost) but that actually makes things worse than using gaussian.

GBM_NTREES = 150
GBM_SHRINKAGE = 0.1
GBM_DEPTH = 4
GBM_MINOBS = 50
> GBM_model <- gbm.fit(
+ x = trainDescr 
+ ,y = trainClass 
+ ,distribution = "gaussian"
+ ,n.trees = GBM_NTREES
+ ,shrinkage = GBM_SHRINKAGE
+ ,interaction.depth = GBM_DEPTH
+ ,n.minobsinnode = GBM_MINOBS
+ ,verbose = TRUE)
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        0.0603             nan     0.1000    0.0019
     2        0.0588             nan     0.1000    0.0016
     3        0.0575             nan     0.1000    0.0013
     4        0.0563             nan     0.1000    0.0011
     5        0.0553             nan     0.1000    0.0010
     6        0.0546             nan     0.1000    0.0008
     7        0.0539             nan     0.1000    0.0007
     8        0.0533             nan     0.1000    0.0006
     9        0.0528             nan     0.1000    0.0005
    10        0.0524             nan     0.1000    0.0004
   100        0.0484             nan     0.1000    0.0000
   150        0.0481             nan     0.1000   -0.0000
> prediction <- predict.gbm(object = GBM_model
+ ,newdata = testDescr
+ ,GBM_NTREES)
> hist(prediction)
> range(prediction)
[1] -0.02945224  1.00706700

Bernoulli:

GBM_model <- gbm.fit(
x = trainDescr 
,y = trainClass 
,distribution = "bernoulli"
,n.trees = GBM_NTREES
,shrinkage = GBM_SHRINKAGE
,interaction.depth = GBM_DEPTH
,n.minobsinnode = GBM_MINOBS
,verbose = TRUE)
prediction <- predict.gbm(object = GBM_model
+ ,newdata = testDescr
+ ,GBM_NTREES)
> hist(prediction)
> range(prediction)
[1] -4.699324  3.043440

And adaboost:

GBM_model <- gbm.fit(
x = trainDescr 
,y = trainClass 
,distribution = "adaboost"
,n.trees = GBM_NTREES
,shrinkage = GBM_SHRINKAGE
,interaction.depth = GBM_DEPTH
,n.minobsinnode = GBM_MINOBS
,verbose = TRUE)
> prediction <- predict.gbm(object = GBM_model
+ ,newdata = testDescr
+ ,GBM_NTREES)
> hist(prediction)
> range(prediction)
[1] -3.0374228  0.9323279

Am I doing something wrong, do I need to preProcess (scale, center) the data or do I need to go in and manually floor/cap the values with something like :

prediction <- ifelse(prediction < 0, 0, prediction)
prediction <- ifelse(prediction > 1, 1, prediction)

Upvotes: 8

Views: 13428

Answers (1)

Hong Ooi
Hong Ooi

Reputation: 57686

From ?predict.gbm:

Returns a vector of predictions. By default the predictions are on the scale of f(x). For example, for the Bernoulli loss the returned value is on the log odds scale, poisson loss on the log scale, and coxph is on the log hazard scale.

If type="response" then gbm converts back to the same scale as the outcome. Currently the only effect this will have is returning probabilities for bernoulli and expected counts for poisson. For the other distributions "response" and "link" return the same.

So if you use distribution="bernoulli", you need to transform the predicted values to rescale them to [0, 1]: p <- plogis(predict.gbm(model)). Using distribution="gaussian" is really for regression as opposed to classification, although I'm surprised that the predictions aren't in [0, 1]: my understanding is that gbm is still based on trees, so the predicted values shouldn't be able to go outside the values present in the training data.

Upvotes: 15

Related Questions