Random forests performed under expectation

Question

I am learning decision tree algorithm and implemented a random forest referring to the RF in Weka. I tested both my implementation and weka implementation (under default settings) with the same dataset. However the accuracy of mine was about 5% lower then the accuracy obtained by Weka version 3.8 (obtained by 'train-first1000.arff' training set and 'dev-first1000.arff' test set).

The arff format dataset I used was movie reviews from IMDb. For each instance, it contains a number of frequencies of top most frequent words and was labeled with either 'P' (for positive) or 'N' (for negative).

For comparison, I used default settings in Weka random forests (100 trees, log() + 1 features considered during split, no baggings and etc)

Here is the Weka results with default settings and my results with same settings (100 trees, 10 random features considered during split):

Firstly I thought there're bugs in my data importers. Then I checked my importers and compared with python arff importers and found that they were performing correctly.

Then I checked the source code of weka RF: http://grepcode.com/file/repo1.maven.org/maven2/nz.ac.waikato.cms.weka/weka-dev/3.7.5/weka/classifiers/trees/RandomForest.java

and compared with mine carefully more than one times to make sure the implementations were the same. Yet I was not able to find a reason why there was still a 5% difference.

Here is the link to my implementation:

https://github.com/YSZhuoyang/Parallelized-Random-Forests/tree/randomforests

More specifically, the main part of training algorithm can be found in "TreeBuilder.cpp" and "TreeBuilder.h".

Updated:

I tested 10 features data and 50 features data separately, the results obtained by my implementation were all lower than weka implementation.

10 Features (100 trees, 4 features to consider during split):

50 Features (100 trees, 6 features to consider during split):

To organize the results a bit and make it more persuasive in terms of the variations due to randomization, I grouped them into the following table (50 features in total, 6 random features considered during split, both random seeds were set to 1):

------------------------------------------------------------
| num total features | num trees | weka result | my result |
|         50         |     1     |    55.61    |   52.34   |
|         50         |     5     |    59.08    |   54.35   |
|         50         |     10    |    60.07    |   55.43   |
|         50         |     20    |    62.54    |   57.20   |
|         50         |     50    |    64.14    |   59.56   |
|         50         |    100    |    65.28    |   61.09   |
------------------------------------------------------------

which showed that it was not due to randomizations.

Solved:

I used diabetes dataset provided by weka which has 8 features only (followed the suggestion given by @alexeykuzmin0), and tested it with random tree on weka, considering all features during split. Then I visualized the tree and compared with my tree, and found that the split point selected on root node was different from mine, which seemed that my calculation of information gain was wrong. Finally I figured out that there was a type error that casted a double type value into an int type, which let to the inaccurate results.

A piece of code:

// Compute entropy of children
for (const vector& group : groups)
{
    double entropyChild = ComputeEntropy( group );

    // Before being corrected
    // unsigned int numChildren = group.size();
    // infoGain -= numChildren / numInstances * entropyChild;

    // Corrected
    double numChildren = group.size();
    infoGain -= numChildren / (double) numInstances * entropyChild;

}

Here is an updated version of comparison of mine and weka:

------------------------------------------------------------
| num total features | num trees | weka result | my result |
|         50         |     1     |    55.61    |   55.34   |
|         50         |     5     |    59.08    |   58.73   |
|         50         |     10    |    60.07    |   60.86   |
|         50         |     20    |    62.54    |   62.97   |
|         50         |     50    |    64.14    |   64.68   |
|         50         |    100    |    65.28    |   65.35   |
------------------------------------------------------------

Thanks for all answers and help.

Random forests performed under expectation

Answers (1)

Random Forests are Breimann's Forests of highly Randomised Trees

So 4% ~ 6% are quite within the reach of this sort of phenomena for RF.

Epilogue

Related Questions