Reputation: 1484
How should I build a fully grown decision tree with one data point at each terminal node? I am looking for a tree model that gives an in-sample error rate equal to 0%.
from sklearn import tree
clf = tree.DecisionTreeClassifier(random_state=0, min_samples_split=2, max_depth=100000000)
clf = clf.fit(feature, tgt)
pred = clf.predict(feature) * tgt
len(pred[pred > 0]) / len(pred)
I am expecting a 1.0 from this code, but for some reason, get 57% instead.
Upvotes: 1
Views: 205
Reputation: 1099
By default, max_depth=None
and min_samples_split=2
so that a tree can expand until
all terminal nodes contain exactly one sample. That is, you don't have to guess the
maximum depth of a fully grown tree.
As for an error rate, you fail to get 1.0 because apparently,
you divide the number of positive samples len(pred[pred > 0])
by the total number
of samples len(pred)
. Try this:
>>> import numpy as np
>>> np.mean(clf.predict(feature) == tgt)
1.0
But it is more convenient to use score
method provided by scikit-learn classifiers:
>>> clf.fit(feature, tgt)
>>> clf.score(feature, tgt)
1.0
It returns the mean accuracy on the given features and targets, exactly what you are looking for.
Upvotes: 1