Matias Eiletz
Matias Eiletz

Reputation: 429

How to force decision tree to split into different classes

I'm doing a decision tree, and I would like to force the algorithm to split the results into different classes after one node. The problem is that in the trees that I get, after evaluating the condition (is X < than a certain value), I get two results of the same class (yes and yes, for example). I want to have "yes" and "no" as results for the evaluation of the node. Here is the example of what I'm getting:

1

This is the code generating the tree and the plot:

clf = tree.DecisionTreeClassifier(max_depth=2)
clf = clf.fit(users_data, users_target)

dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names= feature_names,  
                     class_names= target_names,  
                     filled=True, rounded=True,  
                     special_characters=True) 

graph = graphviz.Source(dot_data)  
graph

I expect to find "YES" and "NO" classes after the nodes. Now, I'm getting the same classes in the lasts levels after the respective conditions.

Thanks!

Upvotes: 3

Views: 3797

Answers (3)

KishKash
KishKash

Reputation: 171

The splits of a decision tree are somewhat speculative, and they happen as long as the chosen criterion is decreased by the split. This, as you noticed, does not guarantee a particular split to result in different classes being the majority after the split. Limiting the depth of the tree is part of the reason you see a split not "played out" until it can reach nodes of distinct classes.

Pruning the tree should help. I was able to avoid a similar problem using a suitable value of the ccp_alpha parameter of the DecisionTreeClassifier. Here are my before- and after- trees.

Before pruning

After pruning

Upvotes: 0

Yonatan Simson
Yonatan Simson

Reputation: 2575

Try using criterion = "entropy". I find this solves the problem

Upvotes: 0

desertnaut
desertnaut

Reputation: 60321

As is, you model indeed does look like it doesn't offer any further discrimination between the first and the second level nodes; so, if you are certain that this is (kind of) optimal for your case, you can simply ask it to stop there using max_depth=1 instead of 2:

clf = tree.DecisionTreeClassifier(max_depth=1)

Keep in mind however that in reality this can be far from optimal; have a look at the tree for the iris dataset from the scikit-learn docs:

enter image description here

where you can see that, further down the tree levels, nodes with class=versicolor emerge from what look like "pure" nodes of class=virginica (and vice versa).

So, before deciding to prune the tree beforehand to max_depth=1, you might want to check if leaving it to grow further (i.e. by not specifying the max_depth argument, thus leaving it in its default value of None), might be better for your case.

Everything depends on why exactly you are doing this (i.e. your business case): if it is an exploratory one, you might very well stop with max_depth=1; if it is a predictive one, you should consider which configuration maximizes an appropriate metric (most probably here, the accuracy).

Upvotes: 3

Related Questions