Reputation: 451
I am working with load_iris data set from sklearn in python and R(it's just called iris in R).
I built the model in both language using "gini" index and in both languages I am able to test the model properly when the test data is taken directly from the iris data set.
However if I give a new data set as a test input, for the same python and R puts it into different categories.
I'm not sure what am I missing here or doing wrong, so any guidance will be very much appreciated.
Code mentioned below: Python 2.7:
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
model = tree.DecisionTreeClassifier(criterion='gini')
model.fit(iris.data, iris.target)
model.score(iris.data, iris.target)
print iris.data[49],model.predict([iris.data[49]])
print iris.data[99],model.predict([iris.data[99]])
print iris.data[100],model.predict([iris.data[100]])
print iris.data[149],model.predict([iris.data[149]])
print [6.3,2.8,6,1.3],model.predict([[6.3,2.8,6,1.3]])
R-Rstudio running 3.3.2 32 bit:
library(rpart)
iris<- iris
x_train = iris[c('Sepal.Length','Sepal.Width','Petal.Length','Petal.Width')]
y_train = as.matrix(cbind(iris['Species']))
x <- cbind(x_train,y_train)
fit <- rpart(y_train ~ ., data = x_train,method="class",parms = list(split = "gini"))
summary(fit)
x_test = x[149,]
x_test[,1]=6.3
x_test[,2]=2.8
x_test[,3]=6
x_test[,4]=1.3
predicted1= predict(fit,x[49,]) # same as python result
predicted2= predict(fit,x[100,]) # same as python result
predicted3= predict(fit,x[101,]) # same as python result
predicted4= predict(fit,x[149,]) # same as python result
predicted5= predict(fit,x_test) ## this value does not match with pythons result
My python output is :
[ 5. 3.3 1.4 0.2] [0]
[ 5.7 2.8 4.1 1.3] [1]
[ 6.3 3.3 6. 2.5] [2]
[ 5.9 3. 5.1 1.8] [2]
[6.3, 2.8, 6, 1.3] [2] -----> this means it's putting the test data into virginica bucket
and R output is:
> predicted1
setosa versicolor virginica
49 1 0 0
> predicted2
setosa versicolor virginica
100 0 0.9074074 0.09259259
> predicted3
setosa versicolor virginica
101 0 0.02173913 0.9782609
> predicted4
setosa versicolor virginica
149 0 0.02173913 0.9782609
> predicted5
setosa versicolor virginica
149 0 0.9074074 0.09259259 --> this means it's putting the test data into versicolor bucket
Please help. Thank you.
Upvotes: 3
Views: 1544
Reputation: 11514
Decision trees involve quite a few parameters (min / max leave size, depth of tree, when to split etc), and different packages may have different default settings. If you want to get the same results, you need to make sure the implicit defaults are similar. For instance, try running the following:
fit <- rpart(y_train ~ ., data = x_train,method="class",
parms = list(split = "gini"),
control = rpart.control(minsplit = 2, minbucket = 1, xval=0, maxdepth = 30))
(predicted5= predict(fit,x_test))
setosa versicolor virginica
149 0 0.3333333 0.6666667
Here, the options minsplit = 2, minbucket = 1, xval=0
and maxdepth = 30
are chosen so as to be identical to the sklearn
-options, see here. maxdepth = 30
is the largest value rpart
will let you have; sklearn
has no bound here). If you want probabilities etc to be identical, you probably want to play around with the cp
parameter as well.
Similarly, with
model = tree.DecisionTreeClassifier(criterion='gini',
min_samples_split=20,
min_samples_leaf=round(20.0/3.0), max_depth=30)
model.fit(iris.data, iris.target)
I get
print model.predict([iris.data[49]])
print model.predict([iris.data[99]])
print model.predict([iris.data[100]])
print model.predict([iris.data[149]])
print model.predict([[6.3,2.8,6,1.3]])
[0]
[1]
[2]
[2]
[1]
which looks pretty similar to your initial R
output.
Needless to say, be careful when your predictions (on the training set) seem "unreasonably good", as you are likely to overfit the data. For instance, have a look at model.predict_proba(...)
, which gives you the probabilities in sklearn
(instead of the predicted classes). You should see that with your current Python code / settings, you are almost surely overfitting.
Upvotes: 6
Reputation: 40878
In addition to @coffeeinjunky's answer, you will need to pay attention to the parameter random_state
(this is the Python parameter, not sure what this is called in R). The generation of the tree itself is psuedo-random, so you need to specify that both have models have the same seed value. Otherwise, you will fit/predict with the same model and get different results at each run because the tree being used is different in each.
Check out the section on decision trees in Mueller & Guido -- 'Python for Machine Learning.' It does a good job of visually explaining the different parameters, and pdfs are floating around the internet if you just try a Google search. With decision trees and ensemble learning methods, the parameters you specify will have a meaningful effect on the predictions.
Upvotes: 2