Reputation: 95
I used a Random Forest Classifier in Python and MATLAB. With 10 trees in the ensemble, I got ~80% accuracy in Python and barely 30% in MATLAB. This difference persisted even when MATLAB's random forests were grown with 100 or 200 tress.
What could be the possible reason for this difference between these two programming languages?
The MATLAB code is below:
load 'path\to\feature vector'; % Observations X Features, loaded as segment_features
load 'path\to\targetValues'; % Observations X Target value, loaded as targets
% Set up Division of Data for Training, Validation, Testing
trainRatio = 70/100;
valRatio = 0/100;
testRatio = 30/100;
[trainInd,valInd,testInd] = dividerand(size(segment_features,1),trainRatio,...
valRatio,testRatio);
% Train the Forest
B=TreeBagger(10,segment_features(trainInd,:), target(trainInd),...
'OOBPred','On');
% Test the Network
outputs_test = predict(B,segment_features(testInd, :));
outputs_test = str2num(cell2mat(outputs_test));
targets_test = target(testInd,:);
Accuracy_test=sum(outputs_test==targets_test)/size(testInd,2);
oobErrorBaggedEnsemble = oobError(B);
plot(oobErrorBaggedEnsemble)
xlabel 'Number of grown trees';
ylabel 'Out-of-bag classification error';
Upvotes: 2
Views: 2327
Reputation: 6599
The Problem
There are many reasons why the implementation of a random forest in two different programming languages (e.g., MATLAB and Python) will yield different results.
First of all, note that results of two random forests trained on the same data will never be identical by design: random forests often choose features at each split randomly and use bootstrapped samples in the construction of each tree.
Second, different programming languages may have different default values set for the hyperparameters of a random forest (e.g., scikit-learn's random forest classifier uses gini as its default criterion to measure the quality of a split.)
Third, it will depend on the size of your data (which you do not specify in your question). Smaller datasets will yield more variability in the structure of your random forests and, in turn, their output will differ more from one forest to another.
Finally, a decision tree is susceptible to variability in input data (slight data perturbations can yield very different trees). Random forests try to get more stable and accurate solutions by growing many trees, but often 10 (or even 100 or 200) are often not enough trees to get stable output.
Toward a Solution
I can recommend several strategies. First, ensure that the way in which the data are loaded into each respective program are equivalent. Is MATLAB misreading a critical variable in a different way from Python, causing the variable to become non-predictive (e.g., misreading a numeric variable as a string variable?).
Second, once you are confident that your data are loaded identically across your two programs, read the documentation of the random forest functions closely and ensure that you are specifying the same hyperparameters (e.g., criterion) in your two programs. You want to ensure that the random forests in each are being created as similarly as possible.
Third, it will likely be necessary to increase the number of trees to get more stable output from your forests. Ensure that the number of trees in both implementations is the same.
Fourth, a potential difference between programs may come from how the data are split into training vs testing sets. It may be necessary to ensure some method that allows you to replicate the same cross-validation sets across your two programming languages (e.g., if you have a unique ID for each record, assign those with even numbers to training and those with odd numbers to testing).
Finally, you may also benefit from creating multiple forests in each programming language and compare the mean accuracy numbers across iterations. These will give you a better sense of whether differences in accuracy are truly reliable and significant or just a fluke.
Good luck!
Upvotes: 2