user1559897
user1559897

Reputation: 1484

Decision Tree Sklearn- how to build a fully-grown tree with one data point at each terminal node?

How should I build a fully grown decision tree with one data point at each terminal node? I am looking for a tree model that gives an in-sample error rate equal to 0%.

from sklearn import tree
clf = tree.DecisionTreeClassifier(random_state=0, min_samples_split=2, max_depth=100000000)
clf = clf.fit(feature, tgt)

pred = clf.predict(feature) * tgt 
len(pred[pred > 0]) / len(pred)

I am expecting a 1.0 from this code, but for some reason, get 57% instead.

Upvotes: 1

Views: 205

Answers (1)

Sanjar Adilov
Sanjar Adilov

Reputation: 1099

By default, max_depth=None and min_samples_split=2 so that a tree can expand until all terminal nodes contain exactly one sample. That is, you don't have to guess the maximum depth of a fully grown tree.

As for an error rate, you fail to get 1.0 because apparently, you divide the number of positive samples len(pred[pred > 0]) by the total number of samples len(pred). Try this:

>>> import numpy as np
>>> np.mean(clf.predict(feature) == tgt)
1.0

But it is more convenient to use score method provided by scikit-learn classifiers:

>>> clf.fit(feature, tgt)
>>> clf.score(feature, tgt)
1.0

It returns the mean accuracy on the given features and targets, exactly what you are looking for.

Upvotes: 1

Related Questions