Reputation: 91
The code below trains a Random Forest model in R and python. As you notice, the accuracy is better in R (1-0.27=0.73) than in Python (0.69). Furthermore, the importance of features is different in R and Python.
[EDIT] Is there any way to replicate the R results in python, or there are things that are out of control? Some of the tunable parameters are different in two libraries, which makes it hard to make a match.
Does anybody else gets different results from Pyhton and R's random forests? What are the diffs?
library(randomForest)
mydata=read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$admit=factor(mydata$admit)
rf = randomForest(admit~gre+gpa+rank, mydata, ntree=1000,
importance=TRUE, replace=TRUE)
print(rf)
print(rf$importance)
Output:
Call:
randomForest(formula = admit ~ gre + gpa + rank, data = mydata,
ntree = 1000, importance = TRUE, replace = TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 1
OOB estimate of error rate: 28.5%
Confusion matrix:
0 1 class.error
0 254 19 0.06959707
1 95 32 0.74803150
0 1 MeanDecreaseAccuracy MeanDecreaseGini
gre 0.01741400 0.01209596 0.01566284 31.45256
gpa 0.02565179 0.02467486 0.02527394 43.32355
rank 0.02570388 0.04844323 0.03283692 18.15780
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
mydata=pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
train_data = mydata[ ["gre","gpa","rank"]]
train_label = mydata.admit
rfs = RandomForestClassifier(n_estimators=1000,oob_score=True,bootstrap=True)
rfs.fit(train_data,train_label)
print(rfs.oob_score_)
pred=np.round(rfs.oob_decision_function_[:,1])
real=train_label
confusion_matrix(real, pred)
rfs.feature_importances_
Output:
RandomForestClassifier(bootstrap=True,
class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
oob_score=True, random_state=None, verbose=0,
warm_start=False)
0.6925
>>> >>> >>> array([[229, 44],
[ 79, 48]])
array([ 0.34573918, 0.53783772, 0.11642309])
I've found a similar question at Difference between random forest implementation, that links to different benchmarks...
Upvotes: 1
Views: 1917
Reputation: 342
The process of training a Random Forest model (as with many machine learning models) is highly dependent on the values of hyper-parameters of the model, as well as initial random seeds that will affect random choices made during training.
I would assume in your case that the default hyper-parameters chosen in the python and R libraries are different, leading to different behaviours of the models. You could test whether there really is a difference between the behaviours by manually setting all of the hyper-parameters to be equal. I would guess that any remaining difference after this is probably due to numerical issues, or by chance (since different random forests trained on the same data will never be identical due to the randomness in the training).
Upvotes: 1