Reputation: 21
I wrote the following Python code for running the RandomForestClassifier on the Forest CoverType dataset from the UCI ML repo (using default parameter settings). The results are very poor, however, with an accuracy around 60%, while this technique should be able to reach over 90% (with e.g. Weka). I already tried increasing n_estimators to 100, but that didn't result in much improvement.
Any ideas on what I could do to get better results with this technique in scikit-learn, or what could be the reason for this poor performance?
from sklearn.datasets import fetch_covtype
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
covtype = fetch_covtype()
clf = RandomForestClassifier()
scores = cross_validation.cross_val_score(clf, covtype.data, covtype.target)
print scores
[ 0.5483831 0.58210057 0.61055001]
Upvotes: 2
Views: 3218
Reputation: 5355
I managed to get a good improvement on your model by using GridSearchCV
from sklearn.datasets import fetch_covtype
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn import grid_search
import numpy as np
covtype = fetch_covtype()
clf = RandomForestClassifier()
X_train, X_test, y_train, y_test = cross_validation.train_test_split(covtype.data,
covtype.target,
test_size=0.33,
random_state=42)
params = {'n_estimators':[30, 50, 100],
'max_features':['sqrt', 'log2', 10]}
gsv = grid_search.GridSearchCV(clf, params, cv=3,
n_jobs=-1, scoring='f1')
gsv.fit(X_train, y_train)
print metrics.classification_report(y_train, gsv.best_estimator_.predict(X_train))
print metrics.classification_report(y_test, gsv.best_estimator_.predict(X_test))
Outputs:
precision recall f1-score support
1 1.00 1.00 1.00 141862
2 1.00 1.00 1.00 189778
3 1.00 1.00 1.00 24058
4 1.00 1.00 1.00 1872
5 1.00 1.00 1.00 6268
6 1.00 1.00 1.00 11605
7 1.00 1.00 1.00 13835
avg / total 1.00 1.00 1.00 389278
precision recall f1-score support
1 0.97 0.95 0.96 69978
2 0.95 0.97 0.96 93523
3 0.95 0.96 0.95 11696
4 0.92 0.86 0.89 875
5 0.94 0.78 0.86 3225
6 0.94 0.90 0.92 5762
7 0.97 0.95 0.96 6675
avg / total 0.96 0.96 0.96 191734
Which isn't too far off the scores on the Kaggle leaderboard (please note that the Kaggle competition uses a much more challenging data split though!)
If you want to see more improvements then you will have to consider the uneven classes and how best to select your training data.
NOTE
I used a smaller number of estimators than I would have typically to save time, however the model performed well on the training set, so you may not have to consider this.
I used a small number of max_features
as typically this reduces bias in model training. Though this isn't always true.
I used f1
scoring as I don't know the dataset well, and f1
tends to work quite well on classification problems.
Upvotes: 1
Reputation: 397
Did you get 90% with the same dataset and the same estimator? Because the data-set is splitted among
first 11,340 records used for training data subset
next 3,780 records used for validation data subset
last 565,892 records used for testing data subset
and the documentation claims the following performance, which makes your un-tunned random forest not so poor :
70% Neural Network (backpropagation)
58% Linear Discriminant Analysis
As for n_estimators
equals to 100, you can increase up to 500, 1.000 or even more. Check the results for each and keep the number when the score starts to stabilize.
The problem might come from the default hyperparameters of Weka compared to the Scikit-Learn ones. You can tune some of them to improve your results :
max_features
for the number of features to split on at each tree node.max_depth
maybe the model overfits a bit your training data by getting too deepmin_samples_split
, min_samples_leaf
,min_weight_fraction_leaf
and max_leaf_nodes
deals with the repartition of the samples among the leaves - when to keep them, or not.You may also try to work on your features by combining them, or, by reducing the dimension.
You should have a look on kaggle scripts such as here were they describe how to get 78% with ExtraTreesClassifier
(however, the training set contains the 11.340 + 3780 recors - they seem to use a higher number of n_estimators
though
Upvotes: 0
Reputation: 2039
You can try the following to improve your score:-
Train your model on all the attributes available to you. It will over train but it will give you an idea how much accuracy you can reach on training set.
Next drop the least important features by using clf.feature_importances_
Use Grid Search CV to tune the hyper parameters for your model. Use crossvalidation and oob_score(out of bag score) to get a better estimate of generalization.
Upvotes: 0