Reputation: 43
I apply the random forest algorithm in three different programming languages to the same pseudo sample dataset (1000 obs, binary 1/0 dependent variable, 10 numeric explanatory variables):
I also try to to keep all model parameters identical across programming languages (no. of trees, bootstrap sampling of the whole sample, no. of variables randomly sampled as candidates at each split, criterion to measure the quality of a split).
While Matlab and Python produce basically the same results (i.e. probabilties), the R results are very different.
What could be the possible reason for the difference between the results produced by R on the one hand side, and by Matlab & Python on the other?
I guess there's some default model parameter that differs in R which I'm not aware of or which is hard-coded in the underlying randomForest package.
The exact code I ran looks as follows:
Matlab:
b = TreeBagger(1000,X,Y, 'FBoot',1, 'NVarToSample',4, 'MinLeaf',1, 'Method', 'classification','Splitcriterion', 'gdi')
[~,scores,~] = predict(b,X);
Python:
clf = RandomForestClassifier(n_estimators=1000, max_features=4, bootstrap=True)
scores_fit = clf.fit(X, Y)
scores = pd.DataFrame(clf.predict_proba(X))
R:
results.rf <- randomForest(X,Y, ntree=1000, type = "classification", sampsize = length(Y),replace=TRUE,mtry=4)
scores <- predict(results.rf, type="prob",
norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
Upvotes: 4
Views: 1993
Reputation: 40628
When you call predict
on a randomForest
object in R
without providing a dataset, it returns the out-of-bag predictions. In your other methods, you are passing in the training data again. I suspect that if you do this in the R version, your probabilities will be similar:
scores <- predict(results.rf, X, type="prob",
norm.votes=FALSE, predict.all=FALSE, proximity=FALSE, nodes=FALSE)
Also note, that if you want unbiased probabilites, the R approach of returning OOB predictions is the best approach when predicting on training data.
Upvotes: 4