Bart Goethals
Bart Goethals

Reputation: 21

Poor performance of RandomForestClassifier

I wrote the following Python code for running the RandomForestClassifier on the Forest CoverType dataset from the UCI ML repo (using default parameter settings). The results are very poor, however, with an accuracy around 60%, while this technique should be able to reach over 90% (with e.g. Weka). I already tried increasing n_estimators to 100, but that didn't result in much improvement.

Any ideas on what I could do to get better results with this technique in scikit-learn, or what could be the reason for this poor performance?

    from sklearn.datasets import fetch_covtype
    from sklearn.ensemble import RandomForestClassifier
    from sklearn import cross_validation


    covtype = fetch_covtype()
    clf = RandomForestClassifier()
    scores = cross_validation.cross_val_score(clf, covtype.data, covtype.target)
    print scores

[ 0.5483831   0.58210057  0.61055001] 

Upvotes: 2

Views: 3218

Answers (3)

piman314
piman314

Reputation: 5355

I managed to get a good improvement on your model by using GridSearchCV

from sklearn.datasets import fetch_covtype
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn import grid_search
import numpy as np


covtype = fetch_covtype()
clf = RandomForestClassifier()

X_train, X_test, y_train, y_test = cross_validation.train_test_split(covtype.data,
                                                                     covtype.target,
                                                                     test_size=0.33,
                                                                     random_state=42)
params = {'n_estimators':[30, 50, 100],
          'max_features':['sqrt', 'log2', 10]}
gsv = grid_search.GridSearchCV(clf, params, cv=3,
                               n_jobs=-1, scoring='f1')
gsv.fit(X_train, y_train)

print metrics.classification_report(y_train, gsv.best_estimator_.predict(X_train))

print metrics.classification_report(y_test, gsv.best_estimator_.predict(X_test))

Outputs:

         precision    recall  f1-score   support

          1       1.00      1.00      1.00    141862
          2       1.00      1.00      1.00    189778
          3       1.00      1.00      1.00     24058
          4       1.00      1.00      1.00      1872
          5       1.00      1.00      1.00      6268
          6       1.00      1.00      1.00     11605
          7       1.00      1.00      1.00     13835

avg / total       1.00      1.00      1.00    389278

             precision    recall  f1-score   support

          1       0.97      0.95      0.96     69978
          2       0.95      0.97      0.96     93523
          3       0.95      0.96      0.95     11696
          4       0.92      0.86      0.89       875
          5       0.94      0.78      0.86      3225
          6       0.94      0.90      0.92      5762
          7       0.97      0.95      0.96      6675

avg / total       0.96      0.96      0.96    191734

Which isn't too far off the scores on the Kaggle leaderboard (please note that the Kaggle competition uses a much more challenging data split though!)

If you want to see more improvements then you will have to consider the uneven classes and how best to select your training data.

NOTE

I used a smaller number of estimators than I would have typically to save time, however the model performed well on the training set, so you may not have to consider this.

I used a small number of max_features as typically this reduces bias in model training. Though this isn't always true.

I used f1 scoring as I don't know the dataset well, and f1 tends to work quite well on classification problems.

Upvotes: 1

Igor OA
Igor OA

Reputation: 397

Did you get 90% with the same dataset and the same estimator? Because the data-set is splitted among

first 11,340 records used for training data subset

next 3,780 records used for validation data subset

last 565,892 records used for testing data subset

and the documentation claims the following performance, which makes your un-tunned random forest not so poor :

70% Neural Network (backpropagation)

58% Linear Discriminant Analysis

As for n_estimators equals to 100, you can increase up to 500, 1.000 or even more. Check the results for each and keep the number when the score starts to stabilize.

The problem might come from the default hyperparameters of Weka compared to the Scikit-Learn ones. You can tune some of them to improve your results :

  • max_features for the number of features to split on at each tree node.
  • max_depth maybe the model overfits a bit your training data by getting too deep
  • min_samples_split, min_samples_leaf,min_weight_fraction_leaf and max_leaf_nodes deals with the repartition of the samples among the leaves - when to keep them, or not.

You may also try to work on your features by combining them, or, by reducing the dimension.

You should have a look on kaggle scripts such as here were they describe how to get 78% with ExtraTreesClassifier (however, the training set contains the 11.340 + 3780 recors - they seem to use a higher number of n_estimators though

Upvotes: 0

Abhishek Sharma
Abhishek Sharma

Reputation: 2039

You can try the following to improve your score:-

  1. Train your model on all the attributes available to you. It will over train but it will give you an idea how much accuracy you can reach on training set.

  2. Next drop the least important features by using clf.feature_importances_

  3. Use Grid Search CV to tune the hyper parameters for your model. Use crossvalidation and oob_score(out of bag score) to get a better estimate of generalization.

Upvotes: 0

Related Questions