Markkk
Markkk

Reputation: 43

Random Forest discrepancy between R and Matlab & Python

I apply the random forest algorithm in three different programming languages to the same pseudo sample dataset (1000 obs, binary 1/0 dependent variable, 10 numeric explanatory variables):

  1. Matlab 2015a (same for 2012a) using the "Treebagger" command (part of the Statistics and Machine Learning Toolbox)
  2. R using the "randomForest" package: https://cran.r-project.org/web/packages/randomForest/index.html
  3. Python using the "RandomForestClassifier" from sklearn.ensemble: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

I also try to to keep all model parameters identical across programming languages (no. of trees, bootstrap sampling of the whole sample, no. of variables randomly sampled as candidates at each split, criterion to measure the quality of a split).

While Matlab and Python produce basically the same results (i.e. probabilties), the R results are very different.

What could be the possible reason for the difference between the results produced by R on the one hand side, and by Matlab & Python on the other?

I guess there's some default model parameter that differs in R which I'm not aware of or which is hard-coded in the underlying randomForest package.

The exact code I ran looks as follows:

Matlab:

 b = TreeBagger(1000,X,Y, 'FBoot',1, 'NVarToSample',4, 'MinLeaf',1, 'Method', 'classification','Splitcriterion', 'gdi')
 [~,scores,~] = predict(b,X);

Python:

 clf = RandomForestClassifier(n_estimators=1000, max_features=4, bootstrap=True)
 scores_fit = clf.fit(X, Y)
 scores = pd.DataFrame(clf.predict_proba(X))

R:

 results.rf <- randomForest(X,Y,  ntree=1000, type = "classification", sampsize = length(Y),replace=TRUE,mtry=4)
 scores <- predict(results.rf, type="prob",
    norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)

Upvotes: 4

Views: 1993

Answers (1)

Zelazny7
Zelazny7

Reputation: 40628

When you call predict on a randomForest object in R without providing a dataset, it returns the out-of-bag predictions. In your other methods, you are passing in the training data again. I suspect that if you do this in the R version, your probabilities will be similar:

 scores <- predict(results.rf, X, type="prob",
    norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)

Also note, that if you want unbiased probabilites, the R approach of returning OOB predictions is the best approach when predicting on training data.

Upvotes: 4

Related Questions