Oscar
Oscar

Reputation: 2102

Random forests performed under expectation

I am learning decision tree algorithm and implemented a random forest referring to the RF in Weka. I tested both my implementation and weka implementation (under default settings) with the same dataset. However the accuracy of mine was about 5% lower then the accuracy obtained by Weka version 3.8 (obtained by 'train-first1000.arff' training set and 'dev-first1000.arff' test set).

The arff format dataset I used was movie reviews from IMDb. For each instance, it contains a number of frequencies of top most frequent words and was labeled with either 'P' (for positive) or 'N' (for negative).

For comparison, I used default settings in Weka random forests (100 trees, log() + 1 features considered during split, no baggings and etc)

Here is the Weka results with default settings and my results with same settings (100 trees, 10 random features considered during split):

Weka results and my results

Firstly I thought there're bugs in my data importers. Then I checked my importers and compared with python arff importers and found that they were performing correctly.

Then I checked the source code of weka RF: http://grepcode.com/file/repo1.maven.org/maven2/nz.ac.waikato.cms.weka/weka-dev/3.7.5/weka/classifiers/trees/RandomForest.java

and compared with mine carefully more than one times to make sure the implementations were the same. Yet I was not able to find a reason why there was still a 5% difference.

Here is the link to my implementation:

https://github.com/YSZhuoyang/Parallelized-Random-Forests/tree/randomforests

More specifically, the main part of training algorithm can be found in "TreeBuilder.cpp" and "TreeBuilder.h".

Updated:

I tested 10 features data and 50 features data separately, the results obtained by my implementation were all lower than weka implementation.

10 Features (100 trees, 4 features to consider during split): 10 features, 100 trees

50 Features (100 trees, 6 features to consider during split): 50 features, 100 trees

To organize the results a bit and make it more persuasive in terms of the variations due to randomization, I grouped them into the following table (50 features in total, 6 random features considered during split, both random seeds were set to 1):

------------------------------------------------------------
| num total features | num trees | weka result | my result |
|         50         |     1     |    55.61    |   52.34   |
|         50         |     5     |    59.08    |   54.35   |
|         50         |     10    |    60.07    |   55.43   |
|         50         |     20    |    62.54    |   57.20   |
|         50         |     50    |    64.14    |   59.56   |
|         50         |    100    |    65.28    |   61.09   |
------------------------------------------------------------

which showed that it was not due to randomizations.

Solved:

I used diabetes dataset provided by weka which has 8 features only (followed the suggestion given by @alexeykuzmin0), and tested it with random tree on weka, considering all features during split. Then I visualized the tree and compared with my tree, and found that the split point selected on root node was different from mine, which seemed that my calculation of information gain was wrong. Finally I figured out that there was a type error that casted a double type value into an int type, which let to the inaccurate results.

A piece of code:

// Compute entropy of children
for (const vector<unsigned int>& group : groups)
{
    double entropyChild = ComputeEntropy( group );

    // Before being corrected
    // unsigned int numChildren = group.size();
    // infoGain -= numChildren / numInstances * entropyChild;

    // Corrected
    double numChildren = group.size();
    infoGain -= numChildren / (double) numInstances * entropyChild;

}

Here is an updated version of comparison of mine and weka:

------------------------------------------------------------
| num total features | num trees | weka result | my result |
|         50         |     1     |    55.61    |   55.34   |
|         50         |     5     |    59.08    |   58.73   |
|         50         |     10    |    60.07    |   60.86   |
|         50         |     20    |    62.54    |   62.97   |
|         50         |     50    |    64.14    |   64.68   |
|         50         |    100    |    65.28    |   65.35   |
------------------------------------------------------------

Thanks for all answers and help.

Upvotes: 1

Views: 773

Answers (1)

user3666197
user3666197

Reputation: 1

Two implementations of the (almost) same learning algorithm will be principally different in many aspects. That introduces a principal difference the same DataSET will receive slightly different results from otherwise similar high-level machine-learning process.

Event if the implementation is exactly the same, the results are not guaranteed to be identical. The best case to show this is right the Random Forest class of machine-learning learners:

Random Forests are Breimann's Forests of highly Randomised Trees

Notice the word Randomised.

As noted in the comment above, even if none of other implementations' differences may be of kind readers interest, the very nature of the Random Forest requires randomisation -- as a re-used tool -- to take place in the processing.

Rigorous research requires repeatable experiments to be used and thus Random Forest learners typically allow to provide an explicitly provided value of a seed value for the underlying RNG ( random number generator ).

Testing the same RandomForest .fit() method on the same DataSET, but with different seed value provided yields a certain amount of variance in the resulting (im)-precision achieved on the otherwise principally identical process repetitions. Providing the same seed value ought provide the same results ( if and only if, in case, no other host-co-located process damages the interim stateful-states of the RNG-factory used inside the Random Forest generation phase ).

enter image description here

Good news is, that this amount of the principal variance reduces from some >10% down, as the ensemble grows ( the lower the number of trees, the lower the principal RNG-introduced variance ).

So 4% ~ 6% are quite within the reach of this sort of phenomena for RF.

enter image description here

Z-axis shows the increasing predictive performance of the Random Forest ensemble method + decreasing amount of principal variance ( introduced by the different RNG( seed ) ) superimposed on the predictions.

X-axis ( running to the right ) carries N_trees

Y-axis ( running to the left ) carries seed-ordinal number ~ 100 different explicit seed values ( started from aBASE, aBASE+1, +2, +3, ..., +99 )


Epilogue

Given the OP, you can re-test your own implementation for RNG-seed dependence by re-running your RF-learner implementation on the same DataSET to receive a quantitative assessment of this phenomenon on your [ Your-Learner, DataSET, ( Your-RNG, seed ) ] (now)-deterministic system.

If you can perform the same test on [ Weka-Learner, DataSET, ( Weka-RNG, seed ) ] complex, the sharper view on this principal landscape you get.

Upvotes: 1

Related Questions