Reputation: 351
I used the Sonar example on the Caret page with the 2 classes sonar classification. The Sonar Class column is a factor with levels ordered as M and R, I changed the order of this factors to R and M and noticed that the predictions changed too, here is my code:
library(mlbench)
library(caret)
data(Sonar)
set.seed(998)
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = (1:30)*50,
shrinkage = 0.1,
n.minobsinnode = 20)
### original data set with Sonar$Class levels : c('M','R')
levels(Sonar$Class)
inTraining_MR <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training_MR <- Sonar[ inTraining_MR,]
testing_MR <- Sonar[-inTraining_MR,]
set.seed(825)
gbmFit_MR <- train(Class ~ ., data = training_MR,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC")
gbmFit_MR
pred_MR = predict(gbmFit_MR, newdata = head(testing_MR))
prob_MR = predict(gbmFit_MR, newdata = head(testing_MR), type = "prob")
res_MR = data.frame(observed = head(Sonar$Class),
predicted = pred_MR,
probM = prob_MR$M,
probR = prob_MR$R)
res_MR
### modified data set with Sonar$Class levels : c('R','M')
Sonar$Class = factor(Sonar$Class, levels=c('R','M'))
levels(Sonar$Class)
set.seed(998)
inTraining_RM <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training_RM <- Sonar[ inTraining_RM,]
testing_RM <- Sonar[-inTraining_RM,]
set.seed(825)
gbmFit_RM <- train(Class ~ ., data = training_RM,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC")
gbmFit_RM
pred_RM = predict(gbmFit_RM, newdata = head(testing_RM))
prob_RM = predict(gbmFit_RM, newdata = head(testing_RM), type = "prob")
res_RM = dataframe(observed = head(Sonar$Class),
predicted = pred_RM,
probM = prob_RM$M,
probR = prob_RM$R)
res_RM
the predictions results:
>levels(Sonar$Class)
[1] "M" "R"
> res_MR
DataFrame with 6 rows and 4 columns
observed predicted probM probR
<factor> <factor> <numeric> <numeric>
1 R R 9.799645e-04 0.9990200355
2 R R 1.825908e-04 0.9998174092
3 R R 5.373401e-08 0.9999999463
4 R R 1.693365e-03 0.9983066351
5 R M 9.999348e-01 0.0000651877
6 R M 9.862454e-01 0.0137546480
> levels(Sonar$Class)
[1] "R" "M"
> res_RM
DataFrame with 6 rows and 4 columns
observed predicted probM probR
<factor> <factor> <numeric> <numeric>
1 R R 0.091199794 0.90880021
2 R R 0.080191807 0.91980819
3 R R 0.005814888 0.99418511
4 R R 0.395159792 0.60484021
5 R R 0.009127547 0.99087245
6 R M 0.966860393 0.03313961
As you can see, gbmFit_MR and gbmFit_Rm produced different models threrefore res_MR and res_RM produced different predictions although they have the same set.seed values.
I imagine that the order of the factors as an impact on the model construction as one of them is the 'positive' or 'case' class as in pRoc package, but I couldn't found where it was mentioned in the Caret documentation?
Upvotes: 0
Views: 1560
Reputation: 351
Thanks for spending time on my question, I checked your link and tried to apply the seeds argument in the traincontrol
method:
library(mlbench)
library(caret)
library(doParallel)
data(Sonar)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(998)
seeds <- vector(mode = "list", length = 101) #length is = (n_repeats*nresampling)+1
for(i in 1:100) seeds[[i]]<- sample.int(n=1000, 3)
seeds[[101]]<-sample.int(1000, 1) #for the last model
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
seeds = seeds,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
gbmGrid <- expand.grid(interaction.depth = c(1, 5, 9),
n.trees = (1:30)*50,
shrinkage = 0.1,
n.minobsinnode = 20)
### original data set with Sonar$Class levels : c('M','R')
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
print(paste('training Class levels:', paste0(levels(training$Class),collapse = ' ')))
testing <- Sonar[-inTraining,]
print(paste('testing Class levels:', paste0(levels(testing$Class),collapse = ' ')))
gbmFit_MR <- train(Class ~ ., data = training,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC")
gbmFit_MR
pred_MR = predict(gbmFit_MR, newdata = head(testing))
prob_MR = predict(gbmFit_MR, newdata = head(testing), type = "prob")
res_MR = data.frame(observed = head(Sonar$Class),
predicted = pred_MR,
probM = prob_MR$M,
probR = prob_MR$R)
res_MR
### modified training and test set with Sonar$Class levels : c('R','M')
training$Class = factor(training$Class, levels=c('R','M'))
print(paste('Modified training Class levels:', paste0(levels(training$Class),collapse = ' ')))
testing$Class = factor(testing$Class, levels=c('R','M'))
print(paste('Modified testing Class levels:', paste0(levels(testing$Class),collapse = ' ')))
gbmFit_RM <- train(Class ~ ., data = training,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC")
gbmFit_RM
pred_RM = predict(gbmFit_RM, newdata = head(testing))
prob_RM = predict(gbmFit_RM, newdata = head(testing), type = "prob")
res_RM = data.frame(observed = head(Sonar$Class),
predicted = pred_RM,
probM = prob_RM$M,
probR = prob_RM$R)
res_RM
all.equal(prob_MR, prob_RM)
But the discrepancy is still there:
> all.equal(prob_MR, prob_RM)
[1] "Names: 2 string mismatches" "Component 1: Mean relative difference: 1.991608"
[3] "Component 2: Mean relative difference: 1.844898"
> res_RM
observed predicted probM probR
1 R M 9.999865e-01 1.354416e-05
2 R R 4.956433e-09 1.000000e+00
3 R M 8.787160e-01 1.212840e-01
4 R M 9.826566e-01 1.734338e-02
5 R R 1.479808e-09 1.000000e+00
6 R R 2.220446e-16 1.000000e+00
> res_MR
observed predicted probM probR
1 R M 9.966034e-01 0.003396588
2 R R 1.028894e-04 0.999897111
3 R M 9.403971e-01 0.059602914
4 R M 9.481307e-01 0.051869320
5 R R 4.457038e-05 0.999955430
6 R R 1.740424e-08 0.999999983
The point here is to check the influence of the Class factor order, the only thing that is changing between the two models is the order of the factor Class in the Sonar data frame, M and R for the first case, R and M for the second case.
EDIT: I changed my answer using the code of the link provided by geekoverdose, as I misinterpreted his response on my first read...
Upvotes: 0
Reputation: 1007
You are looking at different samples for your two models.
As you are using different training and test partitions for your two models, you are just comparing different samples too in the test partition. For preventing this (and making your models comparable) you should use the same partitions (skip the second createDataPartition
and just use the same indexes, or use set.seed(...)
with the same seed right before them).
BTW: the other cross validation results (overall model performance) should still be fairly similar.
Edit: you further might need to look into how to make caret model training itself fully reproducible (due to using different seeds internally, e.g. with parallelization, see this question), which boils down to using the seeds
parameter of train.
Upvotes: 1