Random Forest Libraries: Different Results in R and Python

Question

The code below trains a Random Forest model in R and python. As you notice, the accuracy is better in R (1-0.27=0.73) than in Python (0.69). Furthermore, the importance of features is different in R and Python.

[EDIT] Is there any way to replicate the R results in python, or there are things that are out of control? Some of the tunable parameters are different in two libraries, which makes it hard to make a match.

Does anybody else gets different results from Pyhton and R's random forests? What are the diffs?

R code:

library(randomForest)
mydata=read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$admit=factor(mydata$admit)
rf = randomForest(admit~gre+gpa+rank, mydata, ntree=1000, 
importance=TRUE, replace=TRUE)
print(rf)
print(rf$importance)

Output:

 Call:
 randomForest(formula = admit ~ gre + gpa + rank, data = mydata,      
 ntree = 1000, importance = TRUE, replace = TRUE) 
           Type of random forest: classification
                 Number of trees: 1000
 No. of variables tried at each split: 1

    OOB estimate of  error rate: 28.5%
Confusion matrix:
   0  1 class.error
0 254 19  0.06959707
1  95 32  0.74803150
          0          1 MeanDecreaseAccuracy MeanDecreaseGini
gre  0.01741400 0.01209596           0.01566284         31.45256
gpa  0.02565179 0.02467486           0.02527394         43.32355
rank 0.02570388 0.04844323           0.03283692         18.15780

Python Code

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix

mydata=pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
train_data = mydata[ ["gre","gpa","rank"]]
train_label = mydata.admit

rfs = RandomForestClassifier(n_estimators=1000,oob_score=True,bootstrap=True)    
rfs.fit(train_data,train_label)
print(rfs.oob_score_)

pred=np.round(rfs.oob_decision_function_[:,1])
real=train_label
confusion_matrix(real, pred)
rfs.feature_importances_

Output:

RandomForestClassifier(bootstrap=True, 
class_weight=None, criterion='gini',
        max_depth=None, max_features='auto', max_leaf_nodes=None,
        min_impurity_decrease=0.0, min_impurity_split=None,
        min_samples_leaf=1, min_samples_split=2,
        min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
        oob_score=True, random_state=None, verbose=0, 
warm_start=False)
0.6925
>>> >>> >>> array([[229,  44],
                  [ 79,  48]])
array([ 0.34573918,  0.53783772,  0.11642309])

I've found a similar question at Difference between random forest implementation, that links to different benchmarks...

Random Forest Libraries: Different Results in R and Python

R code:

Python Code

Answers (1)

Related Questions