Zelphir Kaltstahl
Zelphir Kaltstahl

Reputation: 6189

sklearn DecisionTreeClassifier more depth less accuracy?

I have two learned sklearn.tree.tree.DecisionTreeClassifiers. Both are trained with the same training data. Both learned with different maximum depths for the decision trees. The depth for the decision_tree_model was 6 and the depth for the small_model was 2. Besides the max_depth, no other parameters were specified.

When I want to get the accuracy on the training data of them both like this:

small_model_accuracy = small_model.score(training_data_sparse_matrix, training_data_labels)
decision_tree_model_accuracy = decision_tree_model.score(training_data_sparse_matrix, training_data_labels)

Surprisingly the output is:

small_model accuracy: 0.61170212766
decision_tree_model accuracy: 0.422496238986

How is this even possible? Shouldn't a tree with a higher maximum depth always have a higher accuracy on the training data when learned with the same training data? Is it maybe that score function, which outputs the 1 - accuracy or something?

EDIT:

EDIT#2:

It seems to be a mistake I made with working with the training data. I thought about the whole thing again and concluded: "Well if the depth is higher, the tree shouldn't be the reason for this. What else is there? The training data itself. But I used the same data! Maybe I did something to the training data in between?" Then I checked again and there is a difference in how I use the training data. I need to transform it from an SFrame into a scipy matrix (might have to be sparse too). Now I made another accuracy calculation right after fitting the two models. This one results in 61% accuracy for the small_model and 64% accuracy for the decision_tree_model. That's only 3% more and still somewhat surprising, but at least it's possible.

EDIT#3:

The problem is resolved. I handled the training data in a wrong way and that resulted in different fitting.

Here is the plot of accuracy after fixing the mistakes:

Decision Tree Accuracy

This looks correct and would also explain why the assignment creators chose to choose 6 as the maximum depth.

Upvotes: 3

Views: 2998

Answers (1)

Anthony E
Anthony E

Reputation: 11235

Shouldn't a tree with a higher maximum depth always have a higher accuracy when learned with the same training data?

No, definitely not always. The problem is you're overfitting your model to your training data in fitting a more complex tree. Hence, the lower score as increase the maximum depth.

Upvotes: 1

Related Questions