Reputation: 477
Here is my output from running the train function:
Bagged CART
1251 samples
30 predictors
2 classes: 'N', 'Y'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 1247, 1247, 1247, 1247, 1247, 1247, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.806 0.572 0.0129 0.0263
Here is my confusionMatrix
Bootstrapped (25 reps) Confusion Matrix
(entries are percentages of table totals)
Reference
Prediction N Y
N 24.8 7.9
Y 11.5 55.8
After partitioning the data set - 80% train and 20% test, I train the model, and then I do a "predict" on my test partition and get ~65% accuracy.
Questions:
(1) Does this mean my model is not very good?
(2) Is 'treebag' the proper method since I only have 2 classes: 'N', 'Y' ? Would a Logistic Regression method be better?
(3) Finally, my 1251 samples are roughly 67% 'Y' and 33% 'N'. Could this be "skewing" my training / results? Do I need a ratio closer to 50 - 50?
Any help would be greatly appreciated!!
Upvotes: 1
Views: 4271
Reputation: 14316
Code and a reproducible example would help here.
Assuming the confusion matrix came from running confusionMatrix.train
, then I would say that your model looks pretty good. The difference in accuracy is a little puzzling. I've seen test set results look worse than the resampling results regularly but the bootstrap can be pretty pessimistic in measuring performance and here it looks much better than the test set. Try with a different training/test split and see if you get something similar (or try repeated 10-fold CV).
(a) again, hard to say with what you have posted
(b) that model is excellent and there is no general rule about which model is better or worse (google the "no free lunch" theorem)
(c) that imbalance isn't too bad so I don't think that it is an issue (unless the training and test set percentages are different)
Max
Upvotes: 1