Why Decision Tree code written in python predicts differently than the code written in R?

Question

I am working with load_iris data set from sklearn in python and R(it's just called iris in R).

I built the model in both language using "gini" index and in both languages I am able to test the model properly when the test data is taken directly from the iris data set.

However if I give a new data set as a test input, for the same python and R puts it into different categories.

I'm not sure what am I missing here or doing wrong, so any guidance will be very much appreciated.

Code mentioned below: Python 2.7:

from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
model = tree.DecisionTreeClassifier(criterion='gini')
model.fit(iris.data, iris.target)
model.score(iris.data, iris.target)
print iris.data[49],model.predict([iris.data[49]])
print iris.data[99],model.predict([iris.data[99]])
print iris.data[100],model.predict([iris.data[100]])
print iris.data[149],model.predict([iris.data[149]])
print [6.3,2.8,6,1.3],model.predict([[6.3,2.8,6,1.3]])

R-Rstudio running 3.3.2 32 bit:

library(rpart)
iris<- iris
x_train = iris[c('Sepal.Length','Sepal.Width','Petal.Length','Petal.Width')]
y_train = as.matrix(cbind(iris['Species']))
x <- cbind(x_train,y_train)
fit <- rpart(y_train ~ ., data = x_train,method="class",parms = list(split = "gini"))
summary(fit)
x_test = x[149,]
x_test[,1]=6.3
x_test[,2]=2.8
x_test[,3]=6
x_test[,4]=1.3
predicted1= predict(fit,x[49,]) # same as python result
predicted2= predict(fit,x[100,]) # same as python result 
predicted3= predict(fit,x[101,]) # same as python result
predicted4= predict(fit,x[149,]) # same as python result
predicted5= predict(fit,x_test) ## this value does not match with pythons result

My python output is :

[ 5.   3.3  1.4  0.2] [0]
[ 5.7  2.8  4.1  1.3] [1]
[ 6.3  3.3  6.   2.5] [2]
[ 5.9  3.   5.1  1.8] [2]
[6.3, 2.8, 6, 1.3] [2] -----> this means it's putting the test data into virginica bucket

and R output is:

> predicted1
   setosa versicolor virginica
49      1          0         0
> predicted2
    setosa versicolor  virginica
100      0  0.9074074 0.09259259
> predicted3
    setosa versicolor virginica
101      0 0.02173913 0.9782609
> predicted4
    setosa versicolor virginica
149      0 0.02173913 0.9782609
> predicted5
    setosa versicolor  virginica
149      0  0.9074074 0.09259259 --> this means it's putting the test data into versicolor bucket

Please help. Thank you.

coffeinjunky · Accepted Answer

Decision trees involve quite a few parameters (min / max leave size, depth of tree, when to split etc), and different packages may have different default settings. If you want to get the same results, you need to make sure the implicit defaults are similar. For instance, try running the following:

fit <- rpart(y_train ~ ., data = x_train,method="class",
             parms = list(split = "gini"), 
             control = rpart.control(minsplit = 2, minbucket = 1, xval=0, maxdepth = 30))

(predicted5= predict(fit,x_test))
    setosa versicolor virginica
149      0  0.3333333 0.6666667

Here, the options minsplit = 2, minbucket = 1, xval=0 and maxdepth = 30 are chosen so as to be identical to the sklearn-options, see here. maxdepth = 30 is the largest value rpart will let you have; sklearn has no bound here). If you want probabilities etc to be identical, you probably want to play around with the cp parameter as well.

Similarly, with

model = tree.DecisionTreeClassifier(criterion='gini', 
                                    min_samples_split=20, 
                                    min_samples_leaf=round(20.0/3.0), max_depth=30)
model.fit(iris.data, iris.target)

I get

print model.predict([iris.data[49]])
print model.predict([iris.data[99]])
print model.predict([iris.data[100]])
print model.predict([iris.data[149]])
print model.predict([[6.3,2.8,6,1.3]])

[0]
[1]
[2]
[2]
[1]

which looks pretty similar to your initial R output.

Needless to say, be careful when your predictions (on the training set) seem "unreasonably good", as you are likely to overfit the data. For instance, have a look at model.predict_proba(...), which gives you the probabilities in sklearn (instead of the predicted classes). You should see that with your current Python code / settings, you are almost surely overfitting.

Why Decision Tree code written in python predicts differently than the code written in R?

Answers (2)

Related Questions