Why random forest always give 1.0 prediction score?

Question

I'm trying to test the prediction score of the following classifiers:

- random forest
- k neighbors
- svm
- naïve bayes

I'm not using feature selection or feature scaling (no preprocessing at all).

I'm using a train-test split as follows:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

I tested several datasets (from sklearn):

- load_iris
- load_breast_cancer
- load_wine

In all those 3, random forest always gave perfect prediction (test accuracy 1.0).

I tried to create random samples for classification:

make_classification(flip_y=0.3, weights = [0.65, 0.35], n_features=40, n_redundant=4, n_informative=36,n_classes=2,n_clusters_per_class=1, n_samples=50000)

and again random forest gave perfect prediction on the test set (accuracy 1.0).

All the other classifiers gave good performance on the test set (0.8-0.97) but not perfect (1.0) as random forest.

What am I missing ?
Does random forest really outperforms all other classifiers in a perfect way ?

desertnaut · Accepted Answer

Regarding the perfect accuracy score of 1.0, we have to keep in mind that all these 3 datasets are nowadays considered as actually toy ones, and the same probably holds true for the artificial data generated by scikit-learn's make_classification.

That said, it is true that RF is considered a very powerful classification algorithm. There is even a relatively recent (2014) paper, titled Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?, which concluded (quoting from the abstract, emphasis in the original):

We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest-neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods) [...] We use 121 data sets, which represent the whole UCI data base [...] The classifiers most likely to be the bests are the random forest (RF) versions

Although there has been some criticism of the paper, mainly because it did not include boosted trees (but not only for that, see also Are Random Forests Truly the Best Classifiers?), truth is that, in the area of "traditional", pre-deep learning classification at least, there already was the saying when in doubt, try RF, which the first paper mentioned above came to reinforce.

Why random forest always give 1.0 prediction score?

Answers (1)

Related Questions