R. Psmith
R. Psmith

Reputation: 91

Random Forest Libraries: Different Results in R and Python

The code below trains a Random Forest model in R and python. As you notice, the accuracy is better in R (1-0.27=0.73) than in Python (0.69). Furthermore, the importance of features is different in R and Python.

[EDIT] Is there any way to replicate the R results in python, or there are things that are out of control? Some of the tunable parameters are different in two libraries, which makes it hard to make a match.

Does anybody else gets different results from Pyhton and R's random forests? What are the diffs?

R code:

library(randomForest)
mydata=read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$admit=factor(mydata$admit)
rf = randomForest(admit~gre+gpa+rank, mydata, ntree=1000, 
importance=TRUE, replace=TRUE)
print(rf)
print(rf$importance)   

Output:

 Call:
 randomForest(formula = admit ~ gre + gpa + rank, data = mydata,      
 ntree = 1000, importance = TRUE, replace = TRUE) 
           Type of random forest: classification
                 Number of trees: 1000
 No. of variables tried at each split: 1

    OOB estimate of  error rate: 28.5%
Confusion matrix:
   0  1 class.error
0 254 19  0.06959707
1  95 32  0.74803150
          0          1 MeanDecreaseAccuracy MeanDecreaseGini
gre  0.01741400 0.01209596           0.01566284         31.45256
gpa  0.02565179 0.02467486           0.02527394         43.32355
rank 0.02570388 0.04844323           0.03283692         18.15780

Python Code

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix

mydata=pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
train_data = mydata[ ["gre","gpa","rank"]]
train_label = mydata.admit

rfs = RandomForestClassifier(n_estimators=1000,oob_score=True,bootstrap=True)    
rfs.fit(train_data,train_label)
print(rfs.oob_score_)

pred=np.round(rfs.oob_decision_function_[:,1])
real=train_label
confusion_matrix(real, pred)
rfs.feature_importances_

Output:

RandomForestClassifier(bootstrap=True, 
class_weight=None, criterion='gini',
        max_depth=None, max_features='auto', max_leaf_nodes=None,
        min_impurity_decrease=0.0, min_impurity_split=None,
        min_samples_leaf=1, min_samples_split=2,
        min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
        oob_score=True, random_state=None, verbose=0, 
warm_start=False)
0.6925
>>> >>> >>> array([[229,  44],
                  [ 79,  48]])
array([ 0.34573918,  0.53783772,  0.11642309])

I've found a similar question at Difference between random forest implementation, that links to different benchmarks...

Upvotes: 1

Views: 1917

Answers (1)

Paul Rubenstein
Paul Rubenstein

Reputation: 342

The process of training a Random Forest model (as with many machine learning models) is highly dependent on the values of hyper-parameters of the model, as well as initial random seeds that will affect random choices made during training.

I would assume in your case that the default hyper-parameters chosen in the python and R libraries are different, leading to different behaviours of the models. You could test whether there really is a difference between the behaviours by manually setting all of the hyper-parameters to be equal. I would guess that any remaining difference after this is probably due to numerical issues, or by chance (since different random forests trained on the same data will never be identical due to the randomness in the training).

Upvotes: 1

Related Questions