gbmforgbm
gbmforgbm

Reputation: 23

error with mnLogloss for multinomial classifier using caret/gbm

I am trying to perform a multinomial classifier. It seems to work and I am able to generate a plot with minimized logLoss vs boosting iterations, however I am having trouble extracting the error value. This is the error when I run the mnLogLoss function.

Error in mnLogLoss(predicted, lev = predicted$label) : 
  'data' should have columns consistent with 'lev'
data has been partitioned into.
-training
-testing
-in both, the column "label" contains the ground truth

library(MLmetrics)
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE,
                           savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)


gbmGrid1 <- expand.grid(.interaction.depth = (1:3), .n.trees = (1:10)*20, .shrinkage = 0.01, .n.minobsinnode = 3)

system.time(
  gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,
                   verbose = 1, metric = "logLoss", tuneGrid = gbmGrid1)
)

gbmPredictions <- predict(gbmFit1, testing)
predicted <- cbind(gbmPredictions, testing)

mnLogLoss(predicted, lev = levels(predicted$label))

Upvotes: 1

Views: 186

Answers (1)

StupidWolf
StupidWolf

Reputation: 46978

For mnLogLoss, it says in the vignette:

data: a data frame with columns ‘obs’ and ‘pred’ for the observed
          and predicted outcomes. For metrics that rely on class
          probabilities, such as ‘twoClassSummary’, columns should also
          include predicted probabilities for each class. See the
          ‘classProbs’ argument to ‘trainControl’.

So it's not asking for the training data. The data parameter here is just an input, so i use some simulated data:

library(caret)

df = data.frame(label=factor(sample(c("a","b"),100,replace=TRUE)),
matrix(runif(500),ncol=50))
training = df[1:50,]
testing = df[1:50,]

fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE,
                           savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)

gbmGrid1 <- expand.grid(.interaction.depth = (1:3), .n.trees = (1:10)*20, .shrinkage = 0.01, .n.minobsinnode = 3)

gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,verbose = 1, metric = "logLoss", tuneGrid = gbmGrid1)
)

And we put together obs, pred and the last two columns are probabilities of each class:

predicted <- data.frame(obs=testing$label,
pred=predict(gbmFit1, testing),
predict(gbmFit1, testing,type="prob"))

head(predicted)

  obs pred         a         b
1   b    a 0.5506054 0.4493946
2   b    a 0.5107631 0.4892369
3   a    b 0.4859799 0.5140201
4   b    a 0.5090264 0.4909736
5   b    b 0.4545746 0.5454254
6   a    a 0.6211514 0.3788486

mnLogLoss(predicted, lev = levels(predicted$obs))
  logLoss 
0.6377392

Upvotes: 0

Related Questions