m_squared
m_squared

Reputation: 245

R - caret::train "random forest" parameters

I'm trying to build a classification model on 60 variables and ~20,000 observations using the train() fx within the caret package. I'm using the random forest method and am returning 0.999 Accuracy on my training set, however when I use the model to predict, it classifies each test observation as the same class (i.e. each of the 20 observations are classified as "1's" out of 5 possible outcomes). I'm certain this is wrong (the test set is for a Coursera quiz, hence my not posting exact code) but I'm not sure what is happening.

My question is that when I call the final model of fit (fit$finalModel), it says it made 500 total trees (default and expected), however the number of variables tried at each split is 35. I know that will classification, the standard number of observations chosen for each split is the square root of the number of total predictors (therefore, should be sqrt(60) = 7.7, call it 8). Could this be the problem??

I'm confused on whether there is something wrong with my model or my data cleaning, etc.

set.seed(10000)
fitControl <- trainControl(method = "cv", number = 5)
fit <- train(y ~ ., data = training, method = "rf", trControl = fitControl)

fit$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
           Type of random forest: classification
                 Number of trees: 500
No. of variables tried at each split: 41

    OOB estimate of  error rate: 0.01%

Upvotes: 0

Views: 1438

Answers (1)

Len Greski
Len Greski

Reputation: 10865

Use of Random Forest for final project for the Johns Hopkins Practical Machine Learning course on Coursera will generate the same prediction for all 20 test cases for the quiz if students fail to remove independent variables that have more than 50% NA values.

SOLUTION: remove variables that have a high proportion of missing values from the model.

Upvotes: 0

Related Questions