Reputation: 23
I'm working with a dataset that has 170 features and 26000 observations. When I fit a DecisionTreeClassifier model to this dataset without passing any restrictions, it produces a tree with all 170 features and 8173 nodes. However, when I try to restrict the tree's features using max_leaf_nodes and max_features (as shown below) and print out the features of the resulting tree, they don't respect the parameters I pass to the classifier. Why would this be happening? I haven't taken care to purge the dataset of collinear variables at this point, so I imagine that could impact my classification, but I'm still surprised that the function seems to ignore the parameters I give it (but it doesn't seem to ignore them entirely, since it doesn't produce the same tree as it did without any restrictions at all).
tuned_tree = DecisionTreeClassifier(max_leaf_nodes=1000, max_features=40)
tuned_tree.fit(X_train, y_train)
print("Number of features: {}".format(tuned_tree.tree_.n_features))
print("Number of nodes (leaves): {}".format(tuned_tree.tree_.node_count),"\n")
Output:
Number of features: 170
Number of nodes (leaves): 1999
Upvotes: 0
Views: 363
Reputation: 116
To be honest, the sklearn documentation can be confusing at times, still I will refer you to it and point out some details. From the docs:
max_features: The number of features to consider when looking for the best split
Now, you might assume that max features would be the number of features used in the tree, however, this is not the case. This is the number of features to consider at each split. See also th thread How max_features parameter works in DecisionTreeClassifier?
max_leaf_nodes: Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity.
Here I think its more a question of terminology. Not all nodes are leafs, but all leafs are nodes (the end nodes in a tree to be specific; also called leaf nodes). To get the leaf nodes you can use:
tuned_tree.tree_.n_leaves
Upvotes: 1