Reputation: 21
I'm building a random forest on some data from work (this means I can't share that data, there are 15k observations), using the caret train function for cross validation, the accuracy of the model is very low: 0.9%.
here's the code I used:
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
model <- train(ICNumber ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = my_data, method = "ranger",
trControl = trainControl(verboseIter = TRUE, savePredictions = T, index=my_folds))
print(model$resample)
--Edit
As Gilles noticed, the folds indices are wrongly constructed and training is done on 20% of the observations, but even if I fix this by adding returnTrain = T
, I'm still getting near zero accuracy
--Edit
model$resample produces this:
Accuracy ___ Kappa_____ Resample
0.026823683_ 0.0260175246_ Fold1
0.002615234_ 0.0019433907_ Fold2
0.002301118_ 0.0017644472_ Fold3
0.001637733_ 0.0007026352_ Fold4
0.010187315_ 0.0094986595_ Fold5
Now if I do the cross validation by hand like this:
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
for (fold in my_folds) {
train_data <- my_data[-fold,]
test_data <- my_data[fold,]
model <- train(ICNumber ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = train_data, method = "ranger",
trControl = trainControl(method = "none"))
p <- predict(model, test_data)
e <- ifelse(p == test_data$ICNumber, T, F)
print(sum(e) / nrow(test_data))
}
I get the following accuracy:
[1] 0.743871
[1] 0.7566957
[1] 0.7380645
[1] 0.7390181
[1] 0.7311168
I was expecting to get about the same accuracy values, what am I doing wrong in train? Or is the manual prediction code wrong?
--Edit
Furthermore, this code works well on the Soybean data and I can reproduce the results from Gilles below
--Edit
--Edit2
Here are some details about my data:
15493 obs. of 17 variables:
ICNUmber is a string with 1531 different values, these are the classes
the other 16 variables are factors with 33 levels
--Edit2
--Edit3
My last experiment was to drop the observations for all the classes occurring less than 10 times, 12k observations of 396 classes remained. For this dataset, the manual and automatic cross validations accuracy match...
--Edit3
Upvotes: 2
Views: 4939
Reputation: 19716
To expand on the excellent answer by Gilles. Apart the mistake in specifying the indexes used for testing and training, to get a fully reproducible model for algorithms that involve some stochastic process like random forrest you should specify the seeds
argument in trainControl
. The length of this argument should equal the number of re-samples + 1 (for the final model):
library(caret)
library(mlbench)
data(Sonar)
data(Sonar)
set.seed(512)
n <- nrow(Sonar)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k = 5, returnTrain = T)
model <- train(Class ~ .,
tuneGrid = data.frame(mtry = c(32),
min.node.size = 1,
splitrule = "gini"),
data = Sonar,
method = "ranger",
trControl = trainControl(verboseIter = F,
savePredictions = T,
index = my_folds,
seeds = rep(512, 6))) #this is the important part
model$resample
#output
Accuracy Kappa Resample
1 0.8536585 0.6955446 Fold1
2 0.8095238 0.6190476 Fold2
3 0.8536585 0.6992665 Fold3
4 0.7317073 0.4786127 Fold4
5 0.8372093 0.6681367 Fold5
now lets do the resample manually:
for (fold in my_folds) {
train_data <- Sonar[fold,]
test_data <- Sonar[-fold,]
model <- train(Class ~ .,
tuneGrid = data.frame(mtry = c(32),
min.node.size = 1,
splitrule = "gini"),
data = train_data,
method = "ranger",
trControl = trainControl(method = "none",
seeds = 512)) #use the same seeds as above
p <- predict(model, test_data)
e <- ifelse(p == test_data$Class, T, F)
print(sum(e) / nrow(test_data))
}
#output
[1] 0.8536585
[1] 0.8095238
[1] 0.8536585
[1] 0.7317073
[1] 0.8372093
@semicolo if you can reproduce this example on the Sonar data set, but not with your own data, then the problem is in the data set and any further insights will need to investigate the data in question.
Upvotes: 2
Reputation: 21
It looks like the train function transforms the class column into a factor, in my dataset there are a lot (about 20%) of classes that have less than 4 observations. When splitting the set by hand, the factor is constructed after the split and for each of the factor value there's at least one observation. But during the automatic cross validation, the factor is constructed on the full dataset and when the splits are done some values of the factor don't have any observation. This seems to somehow mess up the accuracy. This probably calls for a new different question, thanks to Gilles and missuse for their help.
Upvotes: 0
Reputation: 4370
It was a tricky one ! ;-)
The error comes from a misuse of the index
option in trainControl
.
According tho the help page, index
should be :
a list with elements for each resampling iteration. Each list element is a vector of integers corresponding to the rows used for training at that iteration.
In your code you provided the integers coresponding to the rows that should be removed from the training dataset instead of providing the integers corresponding to the rows that should be used...
You can cange that by using createFolds(train_indices, k=5, returnTrain = T)
instead
of createFolds(train_indices, k=5)
.
Note also that internaly, afaik, caret
is creating folds that are balanced relative
to the classes that you want to predict. So the code should ideally be more like :
createFolds(my_data[train_indices, "Class"], k=5, returnTrain = T)
, particularly
if the classes are not balanced...
Here is a reproducible example with the Soybean dataset
library(caret)
#> Le chargement a nécessité le package : lattice
#> Le chargement a nécessité le package : ggplot2
data(Soybean, package = "mlbench")
my_data <- droplevels(na.omit(Soybean))
Your code (the training data is here much smaller than expected, you use only 20% of the data, hence the lower accuracy).
You also get some warnings due to the absense of some classes in the training datasets (because of the class imbalance and reduced training set).
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = my_data, method = "ranger",
trControl = trainControl(verboseIter = F, savePredictions = T,
index=my_folds))
#> Warning: Dropped unused factor level(s) in dependent variable: rhizoctonia-
#> root-rot.
#> Warning: Dropped unused factor level(s) in dependent variable: downy-
#> mildew.
print(model$resample)
#> Accuracy Kappa Resample
#> 1 0.7951002 0.7700909 Fold1
#> 2 0.5846868 0.5400131 Fold2
#> 3 0.8440980 0.8251373 Fold3
#> 4 0.8822222 0.8679453 Fold4
#> 5 0.8444444 0.8263563 Fold5
Corrected code, just with returnTrain = T
(here you really use 80% of the data for training...)
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5, returnTrain = T)
model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = my_data, method = "ranger",
trControl = trainControl(verboseIter = F, savePredictions = T,
index=my_folds))
print(model$resample)
#> Accuracy Kappa Resample
#> 1 0.9380531 0.9293371 Fold1
#> 2 0.8750000 0.8583687 Fold2
#> 3 0.9115044 0.9009814 Fold3
#> 4 0.8660714 0.8505205 Fold4
#> 5 0.9107143 0.9003825 Fold5
To be compared to your loop. There are still some small differences so maybe there is still something that I don't understand.
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
for (fold in my_folds) {
train_data <- my_data[-fold,]
test_data <- my_data[fold,]
model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = train_data, method = "ranger",
trControl = trainControl(method = "none"))
p <- predict(model, test_data)
e <- ifelse(p == test_data$Class, T, F)
print(sum(e) / nrow(test_data))
}
#> [1] 0.9380531
#> [1] 0.875
#> [1] 0.9115044
#> [1] 0.875
#> [1] 0.9196429
Created on 2018-03-09 by the reprex package (v0.2.0).
Upvotes: 4