sklearn DecisionTreeClassifier more depth less accuracy?

Question

I have two learned sklearn.tree.tree.DecisionTreeClassifiers. Both are trained with the same training data. Both learned with different maximum depths for the decision trees. The depth for the decision_tree_model was 6 and the depth for the small_model was 2. Besides the max_depth, no other parameters were specified.

When I want to get the accuracy on the training data of them both like this:

small_model_accuracy = small_model.score(training_data_sparse_matrix, training_data_labels)
decision_tree_model_accuracy = decision_tree_model.score(training_data_sparse_matrix, training_data_labels)

Surprisingly the output is:

small_model accuracy: 0.61170212766
decision_tree_model accuracy: 0.422496238986

How is this even possible? Shouldn't a tree with a higher maximum depth always have a higher accuracy on the training data when learned with the same training data? Is it maybe that score function, which outputs the 1 - accuracy or something?

EDIT:

I just tested it with even higher maximum depth. The value returned becomes even lower. This hints at it being 1 - accuracy or something like that.

EDIT#2:

It seems to be a mistake I made with working with the training data. I thought about the whole thing again and concluded: "Well if the depth is higher, the tree shouldn't be the reason for this. What else is there? The training data itself. But I used the same data! Maybe I did something to the training data in between?" Then I checked again and there is a difference in how I use the training data. I need to transform it from an SFrame into a scipy matrix (might have to be sparse too). Now I made another accuracy calculation right after fitting the two models. This one results in 61% accuracy for the small_model and 64% accuracy for the decision_tree_model. That's only 3% more and still somewhat surprising, but at least it's possible.

EDIT#3:

The problem is resolved. I handled the training data in a wrong way and that resulted in different fitting.

Here is the plot of accuracy after fixing the mistakes:

This looks correct and would also explain why the assignment creators chose to choose 6 as the maximum depth.

sklearn DecisionTreeClassifier more depth less accuracy?

Answers (1)

Related Questions