Reputation: 351

R Caret: How to choose factor order on Class to predict

I used the Sonar example on the Caret page with the 2 classes sonar classification. The Sonar Class column is a factor with levels ordered as M and R, I changed the order of this factors to R and M and noticed that the predictions changed too, here is my code:

library(mlbench)
library(caret)

data(Sonar)

set.seed(998)
fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10,
                           ## Estimate class probabilities
                           classProbs = TRUE,
                           ## Evaluate performance using 
                           ## the following function
                           summaryFunction = twoClassSummary)

gbmGrid <-  expand.grid(interaction.depth = c(1, 5, 9),
                        n.trees = (1:30)*50,
                        shrinkage = 0.1,
                        n.minobsinnode = 20)

### original data set with Sonar$Class levels : c('M','R')
levels(Sonar$Class)
inTraining_MR <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training_MR <- Sonar[ inTraining_MR,]
testing_MR  <- Sonar[-inTraining_MR,]


set.seed(825)
gbmFit_MR <- train(Class ~ ., data = training_MR,
                 method = "gbm",
                 trControl = fitControl,
                 verbose = FALSE,
                 tuneGrid = gbmGrid,
                 ## Specify which metric to optimize
                 metric = "ROC")
gbmFit_MR

pred_MR = predict(gbmFit_MR, newdata = head(testing_MR))
prob_MR = predict(gbmFit_MR, newdata = head(testing_MR), type = "prob")
res_MR = data.frame(observed = head(Sonar$Class),
                  predicted = pred_MR,
                  probM = prob_MR$M,
                  probR = prob_MR$R)
res_MR


### modified data set with Sonar$Class levels : c('R','M')
Sonar$Class = factor(Sonar$Class, levels=c('R','M'))
levels(Sonar$Class)
set.seed(998)
inTraining_RM <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training_RM <- Sonar[ inTraining_RM,]
testing_RM  <- Sonar[-inTraining_RM,]

set.seed(825)
gbmFit_RM <- train(Class ~ ., data = training_RM,
                 method = "gbm",
                 trControl = fitControl,
                 verbose = FALSE,
                 tuneGrid = gbmGrid,
                 ## Specify which metric to optimize
                 metric = "ROC")
gbmFit_RM

pred_RM = predict(gbmFit_RM, newdata = head(testing_RM))
prob_RM = predict(gbmFit_RM, newdata = head(testing_RM), type = "prob")
res_RM = dataframe(observed = head(Sonar$Class),
                  predicted = pred_RM,
                  probM = prob_RM$M,
                  probR = prob_RM$R)
res_RM

the predictions results:

>levels(Sonar$Class)
[1] "M" "R"
> res_MR
DataFrame with 6 rows and 4 columns
  observed predicted        probM        probR
  <factor>  <factor>    <numeric>    <numeric>
1        R         R 9.799645e-04 0.9990200355
2        R         R 1.825908e-04 0.9998174092
3        R         R 5.373401e-08 0.9999999463
4        R         R 1.693365e-03 0.9983066351
5        R         M 9.999348e-01 0.0000651877
6        R         M 9.862454e-01 0.0137546480

> levels(Sonar$Class)
[1] "R" "M"
> res_RM
DataFrame with 6 rows and 4 columns
  observed predicted       probM      probR
  <factor>  <factor>   <numeric>  <numeric>
1        R         R 0.091199794 0.90880021
2        R         R 0.080191807 0.91980819
3        R         R 0.005814888 0.99418511
4        R         R 0.395159792 0.60484021
5        R         R 0.009127547 0.99087245
6        R         M 0.966860393 0.03313961

As you can see, gbmFit_MR and gbmFit_Rm produced different models threrefore res_MR and res_RM produced different predictions although they have the same set.seed values.

I imagine that the order of the factors as an impact on the model construction as one of them is the 'positive' or 'case' class as in pRoc package, but I couldn't found where it was mentioned in the Caret documentation?

Upvotes: 0

Answers (2)

Mesmer

Reputation: 351

Thanks for spending time on my question, I checked your link and tried to apply the seeds argument in the traincontrol method:

library(mlbench)
library(caret)
library(doParallel)

data(Sonar)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

set.seed(998)
seeds <- vector(mode = "list", length = 101) #length is = (n_repeats*nresampling)+1
for(i in 1:100) seeds[[i]]<- sample.int(n=1000, 3)
seeds[[101]]<-sample.int(1000, 1) #for the last model

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10,
                           seeds = seeds,
                           ## Estimate class probabilities
                           classProbs = TRUE,
                           ## Evaluate performance using 
                           ## the following function
                           summaryFunction = twoClassSummary)

gbmGrid <-  expand.grid(interaction.depth = c(1, 5, 9),
                        n.trees = (1:30)*50,
                        shrinkage = 0.1,
                        n.minobsinnode = 20)

### original data set with Sonar$Class levels : c('M','R')
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
print(paste('training Class levels:', paste0(levels(training$Class),collapse = ' ')))
testing  <- Sonar[-inTraining,]
print(paste('testing Class levels:', paste0(levels(testing$Class),collapse = ' ')))

gbmFit_MR <- train(Class ~ ., data = training,
                 method = "gbm",
                 trControl = fitControl,
                 verbose = FALSE,
                 tuneGrid = gbmGrid,
                 ## Specify which metric to optimize
                 metric = "ROC")
gbmFit_MR

pred_MR = predict(gbmFit_MR, newdata = head(testing))
prob_MR = predict(gbmFit_MR, newdata = head(testing), type = "prob")
res_MR = data.frame(observed = head(Sonar$Class),
                  predicted = pred_MR,
                  probM = prob_MR$M,
                  probR = prob_MR$R)
res_MR


### modified training and test set with Sonar$Class levels : c('R','M')
training$Class = factor(training$Class, levels=c('R','M'))
print(paste('Modified training Class levels:', paste0(levels(training$Class),collapse = ' ')))
testing$Class = factor(testing$Class, levels=c('R','M'))
print(paste('Modified testing Class levels:', paste0(levels(testing$Class),collapse = ' ')))

gbmFit_RM <- train(Class ~ ., data = training,
                 method = "gbm",
                 trControl = fitControl,
                 verbose = FALSE,
                 tuneGrid = gbmGrid,
                 ## Specify which metric to optimize
                 metric = "ROC")
gbmFit_RM

pred_RM = predict(gbmFit_RM, newdata = head(testing))
prob_RM = predict(gbmFit_RM, newdata = head(testing), type = "prob")
res_RM = data.frame(observed = head(Sonar$Class),
                  predicted = pred_RM,
                  probM = prob_RM$M,
                  probR = prob_RM$R)
res_RM

all.equal(prob_MR, prob_RM)

But the discrepancy is still there:

> all.equal(prob_MR, prob_RM)
[1] "Names: 2 string mismatches"                      "Component 1: Mean relative difference: 1.991608"
[3] "Component 2: Mean relative difference: 1.844898"

> res_RM
  observed predicted        probM        probR
1        R         M 9.999865e-01 1.354416e-05
2        R         R 4.956433e-09 1.000000e+00
3        R         M 8.787160e-01 1.212840e-01
4        R         M 9.826566e-01 1.734338e-02
5        R         R 1.479808e-09 1.000000e+00
6        R         R 2.220446e-16 1.000000e+00

> res_MR
  observed predicted        probM       probR
1        R         M 9.966034e-01 0.003396588
2        R         R 1.028894e-04 0.999897111
3        R         M 9.403971e-01 0.059602914
4        R         M 9.481307e-01 0.051869320
5        R         R 4.457038e-05 0.999955430
6        R         R 1.740424e-08 0.999999983

The point here is to check the influence of the Class factor order, the only thing that is changing between the two models is the order of the factor Class in the Sonar data frame, M and R for the first case, R and M for the second case.

EDIT: I changed my answer using the code of the link provided by geekoverdose, as I misinterpreted his response on my first read...

Upvotes: 0

geekoverdose

Reputation: 1007

You are looking at different samples for your two models.

As you are using different training and test partitions for your two models, you are just comparing different samples too in the test partition. For preventing this (and making your models comparable) you should use the same partitions (skip the second createDataPartition and just use the same indexes, or use set.seed(...) with the same seed right before them).

BTW: the other cross validation results (overall model performance) should still be fairly similar.

Edit: you further might need to look into how to make caret model training itself fully reproducible (due to using different seeds internally, e.g. with parallelization, see this question), which boils down to using the seeds parameter of train.

Upvotes: 1

R Caret: How to choose factor order on Class to predict

Answers (2)

Related Questions