Reputation: 429
I'm doing a decision tree, and I would like to force the algorithm to split the results into different classes after one node. The problem is that in the trees that I get, after evaluating the condition (is X < than a certain value), I get two results of the same class (yes and yes, for example). I want to have "yes" and "no" as results for the evaluation of the node. Here is the example of what I'm getting:
This is the code generating the tree and the plot:
clf = tree.DecisionTreeClassifier(max_depth=2)
clf = clf.fit(users_data, users_target)
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names= feature_names,
class_names= target_names,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
I expect to find "YES" and "NO" classes after the nodes. Now, I'm getting the same classes in the lasts levels after the respective conditions.
Thanks!
Upvotes: 3
Views: 3797
Reputation: 171
The splits of a decision tree are somewhat speculative, and they happen as long as the chosen criterion is decreased by the split. This, as you noticed, does not guarantee a particular split to result in different classes being the majority after the split. Limiting the depth of the tree is part of the reason you see a split not "played out" until it can reach nodes of distinct classes.
Pruning the tree should help. I was able to avoid a similar problem using a suitable value of the ccp_alpha
parameter of the DecisionTreeClassifier
. Here are my before- and after- trees.
Upvotes: 0
Reputation: 2575
Try using criterion = "entropy". I find this solves the problem
Upvotes: 0
Reputation: 60321
As is, you model indeed does look like it doesn't offer any further discrimination between the first and the second level nodes; so, if you are certain that this is (kind of) optimal for your case, you can simply ask it to stop there using max_depth=1
instead of 2:
clf = tree.DecisionTreeClassifier(max_depth=1)
Keep in mind however that in reality this can be far from optimal; have a look at the tree for the iris dataset from the scikit-learn docs:
where you can see that, further down the tree levels, nodes with class=versicolor
emerge from what look like "pure" nodes of class=virginica
(and vice versa).
So, before deciding to prune the tree beforehand to max_depth=1
, you might want to check if leaving it to grow further (i.e. by not specifying the max_depth
argument, thus leaving it in its default value of None
), might be better for your case.
Everything depends on why exactly you are doing this (i.e. your business case): if it is an exploratory one, you might very well stop with max_depth=1
; if it is a predictive one, you should consider which configuration maximizes an appropriate metric (most probably here, the accuracy).
Upvotes: 3