Reputation: 6189
I have two learned sklearn.tree.tree.DecisionTreeClassifier
s. Both are trained with the same training data. Both learned with different maximum depths for the decision trees. The depth for the decision_tree_model
was 6
and the depth for the small_model
was 2
. Besides the max_depth
, no other parameters were specified.
When I want to get the accuracy on the training data of them both like this:
small_model_accuracy = small_model.score(training_data_sparse_matrix, training_data_labels)
decision_tree_model_accuracy = decision_tree_model.score(training_data_sparse_matrix, training_data_labels)
Surprisingly the output is:
small_model accuracy: 0.61170212766
decision_tree_model accuracy: 0.422496238986
How is this even possible? Shouldn't a tree with a higher maximum depth always have a higher accuracy on the training data when learned with the same training data? Is it maybe that score
function, which outputs the 1 - accuracy
or something?
EDIT:
1 - accuracy
or something like that.EDIT#2:
It seems to be a mistake I made with working with the training data. I thought about the whole thing again and concluded: "Well if the depth is higher, the tree shouldn't be the reason for this. What else is there? The training data itself. But I used the same data! Maybe I did something to the training data in between?"
Then I checked again and there is a difference in how I use the training data. I need to transform it from an SFrame into a scipy matrix (might have to be sparse too). Now I made another accuracy calculation right after fitting the two models. This one results in 61% accuracy for the small_model
and 64% accuracy for the decision_tree_model
. That's only 3% more and still somewhat surprising, but at least it's possible.
EDIT#3:
The problem is resolved. I handled the training data in a wrong way and that resulted in different fitting.
Here is the plot of accuracy after fixing the mistakes:
This looks correct and would also explain why the assignment creators chose to choose 6 as the maximum depth.
Upvotes: 3
Views: 2998
Reputation: 11235
Shouldn't a tree with a higher maximum depth always have a higher accuracy when learned with the same training data?
No, definitely not always. The problem is you're overfitting your model to your training data in fitting a more complex tree. Hence, the lower score as increase the maximum depth.
Upvotes: 1