Reputation: 21

R random forest cross validation using the caret train function doesn't produce the same accuracy as when done by hand

I'm building a random forest on some data from work (this means I can't share that data, there are 15k observations), using the caret train function for cross validation, the accuracy of the model is very low: 0.9%.

here's the code I used:

set.seed(512)
n <- nrow(my_data)

train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)

model <- train(ICNumber ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
                 data = my_data, method = "ranger",
                 trControl = trainControl(verboseIter = TRUE, savePredictions = T, index=my_folds))

print(model$resample)

--Edit
As Gilles noticed, the folds indices are wrongly constructed and training is done on 20% of the observations, but even if I fix this by adding returnTrain = T , I'm still getting near zero accuracy
--Edit

model$resample produces this:

Accuracy ___ Kappa_____ Resample  
0.026823683_ 0.0260175246_ Fold1  
0.002615234_ 0.0019433907_ Fold2  
0.002301118_ 0.0017644472_ Fold3  
0.001637733_ 0.0007026352_ Fold4  
0.010187315_ 0.0094986595_ Fold5

Now if I do the cross validation by hand like this:

set.seed(512)
n <- nrow(my_data)

train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)

for (fold in my_folds) {
  train_data <- my_data[-fold,]
  test_data <- my_data[fold,]

  model <- train(ICNumber ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
                 data = train_data, method = "ranger",
                 trControl = trainControl(method = "none"))

  p <- predict(model, test_data)
  e <- ifelse(p == test_data$ICNumber, T, F)
  print(sum(e) / nrow(test_data))
}

I get the following accuracy:

[1] 0.743871  
[1] 0.7566957  
[1] 0.7380645  
[1] 0.7390181  
[1] 0.7311168

I was expecting to get about the same accuracy values, what am I doing wrong in train? Or is the manual prediction code wrong?

--Edit
Furthermore, this code works well on the Soybean data and I can reproduce the results from Gilles below
--Edit

--Edit2
Here are some details about my data:
15493 obs. of 17 variables:
ICNUmber is a string with 1531 different values, these are the classes
the other 16 variables are factors with 33 levels
--Edit2

--Edit3
My last experiment was to drop the observations for all the classes occurring less than 10 times, 12k observations of 396 classes remained. For this dataset, the manual and automatic cross validations accuracy match...
--Edit3

Upvotes: 2

Answers (3)

missuse

Reputation: 19756

To expand on the excellent answer by Gilles. Apart the mistake in specifying the indexes used for testing and training, to get a fully reproducible model for algorithms that involve some stochastic process like random forrest you should specify the seeds argument in trainControl. The length of this argument should equal the number of re-samples + 1 (for the final model):

library(caret)
library(mlbench)
data(Sonar)

data(Sonar)

set.seed(512)
n <- nrow(Sonar)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k = 5, returnTrain = T)

model <- train(Class ~ .,
               tuneGrid = data.frame(mtry = c(32),
                                     min.node.size = 1,
                                     splitrule = "gini"),
               data = Sonar,
               method = "ranger",
               trControl = trainControl(verboseIter = F,
                                        savePredictions = T, 
                                        index = my_folds,
                                        seeds = rep(512, 6))) #this is the important part

 model$resample
#output
   Accuracy     Kappa Resample
1 0.8536585 0.6955446    Fold1
2 0.8095238 0.6190476    Fold2
3 0.8536585 0.6992665    Fold3
4 0.7317073 0.4786127    Fold4
5 0.8372093 0.6681367    Fold5

now lets do the resample manually:

for (fold in my_folds) {
  train_data <- Sonar[fold,]
  test_data <- Sonar[-fold,]
  model <- train(Class ~ .,
                 tuneGrid = data.frame(mtry = c(32),
                                       min.node.size = 1,
                                       splitrule = "gini"),
                 data = train_data,
                 method = "ranger",
                 trControl = trainControl(method = "none",
                                          seeds = 512)) #use the same seeds as above

  p <- predict(model, test_data)
  e <- ifelse(p == test_data$Class, T, F)
  print(sum(e) / nrow(test_data))
}
#output
[1] 0.8536585
[1] 0.8095238
[1] 0.8536585
[1] 0.7317073
[1] 0.8372093

@semicolo if you can reproduce this example on the Sonar data set, but not with your own data, then the problem is in the data set and any further insights will need to investigate the data in question.

Upvotes: 2

semicolo

Reputation: 21

It looks like the train function transforms the class column into a factor, in my dataset there are a lot (about 20%) of classes that have less than 4 observations. When splitting the set by hand, the factor is constructed after the split and for each of the factor value there's at least one observation. But during the automatic cross validation, the factor is constructed on the full dataset and when the splits are done some values of the factor don't have any observation. This seems to somehow mess up the accuracy. This probably calls for a new different question, thanks to Gilles and missuse for their help.

Upvotes: 0

Gilles San Martin

Reputation: 4370

It was a tricky one ! ;-)
The error comes from a misuse of the index option in trainControl.

According tho the help page, index should be :

a list with elements for each resampling iteration. Each list element is a vector of integers corresponding to the rows used for training at that iteration.

In your code you provided the integers coresponding to the rows that should be removed from the training dataset instead of providing the integers corresponding to the rows that should be used...

You can cange that by using createFolds(train_indices, k=5, returnTrain = T) instead of createFolds(train_indices, k=5).
Note also that internaly, afaik, caret is creating folds that are balanced relative to the classes that you want to predict. So the code should ideally be more like : createFolds(my_data[train_indices, "Class"], k=5, returnTrain = T), particularly if the classes are not balanced...

Here is a reproducible example with the Soybean dataset

library(caret)
#> Le chargement a nécessité le package : lattice
#> Le chargement a nécessité le package : ggplot2
data(Soybean, package = "mlbench")
my_data <- droplevels(na.omit(Soybean))

Your code (the training data is here much smaller than expected, you use only 20% of the data, hence the lower accuracy).
You also get some warnings due to the absense of some classes in the training datasets (because of the class imbalance and reduced training set).

set.seed(512)
n <- nrow(my_data)

train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)

model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
               data = my_data, method = "ranger",
               trControl = trainControl(verboseIter = F, savePredictions = T, 
                                        index=my_folds))
#> Warning: Dropped unused factor level(s) in dependent variable: rhizoctonia-
#> root-rot.
#> Warning: Dropped unused factor level(s) in dependent variable: downy-
#> mildew.

print(model$resample)
#>    Accuracy     Kappa Resample
#> 1 0.7951002 0.7700909    Fold1
#> 2 0.5846868 0.5400131    Fold2
#> 3 0.8440980 0.8251373    Fold3
#> 4 0.8822222 0.8679453    Fold4
#> 5 0.8444444 0.8263563    Fold5

Corrected code, just with returnTrain = T (here you really use 80% of the data for training...)

set.seed(512)
n <- nrow(my_data)

train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5, returnTrain = T)

model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
               data = my_data, method = "ranger",
               trControl = trainControl(verboseIter = F, savePredictions = T, 
                                        index=my_folds))

print(model$resample)
#>    Accuracy     Kappa Resample
#> 1 0.9380531 0.9293371    Fold1
#> 2 0.8750000 0.8583687    Fold2
#> 3 0.9115044 0.9009814    Fold3
#> 4 0.8660714 0.8505205    Fold4
#> 5 0.9107143 0.9003825    Fold5

To be compared to your loop. There are still some small differences so maybe there is still something that I don't understand.

set.seed(512)
n <- nrow(my_data)

train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)

for (fold in my_folds) {
    train_data <- my_data[-fold,]
    test_data <- my_data[fold,]

    model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
                   data = train_data, method = "ranger",
                   trControl = trainControl(method = "none"))

    p <- predict(model, test_data)
    e <- ifelse(p == test_data$Class, T, F)
    print(sum(e) / nrow(test_data))
}
#> [1] 0.9380531
#> [1] 0.875
#> [1] 0.9115044
#> [1] 0.875
#> [1] 0.9196429

Created on 2018-03-09 by the reprex package (v0.2.0).

Upvotes: 4

R random forest cross validation using the caret train function doesn&#39;t produce the same accuracy as when done by hand

Answers (3)

Related Questions

R random forest cross validation using the caret train function doesn't produce the same accuracy as when done by hand