saraherceg
saraherceg

Reputation: 341

Decision tree- is it overfitting?

I am building a tree classifier and I would like to check and fix the possible overfitting. These are the calcuations:

dtc = DecisionTreeClassifier(max_depth=3,min_samples_split=3,min_samples_leaf=1, random_state=0)
dtc_fit = dtc.fit(X_train, y_train)

print("Accuracy using Decision Tree:" ,round(score, 1), "%")

('Accuracy using Decision Tree:', 92.2, '%')


scores = cross_val_score(dtc_fit, X_train, y_train, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.91 (+/- 0.10)

What are the possible values I could fix to get a better result or perhaps these are already fine?

Thank you for the help, I am a beginner therefore unsure of the outcome.

Upvotes: 1

Views: 1813

Answers (1)

Gambit1614
Gambit1614

Reputation: 8801

Not sure exactly if it is overfitting or not, but you can give gridSearchCV a try for the following reasons

  • It will split your datasets into multiple combinations of different splits, hence you will get to know if the decision tree is overfitting on your training set or not (Although this might not neccessary be a valid way of knowing)
  • You can add various parameters by making a dictionary of various parameters and the values that they can have like this

    from sklearn.grid_search import GridSearchCV
    
    parameters_dict = {"max_depth": [2,5,6,10], "min_samples_split" : [0.1, 0.2, 0.3, 0.4], "min_samples_leaf" = [0.1, 0.2, 0.3, 0.4], "criterion": ["gini","entropy"]}
    
    dtc = DecisionTreeClassifier(random_state= 0)
    
    grid_obj = GridSearchCV(estimator=dtc,param_grid=parameters_dict, cv=10)
    
    grid_obj.fit(X_train,y_train)
    
    #Extract the best classifier
    best_clf = grid_obj.best_estimator_
    
  • Also you can try Recursive Feature Elimination with CV to find the best features. (This is an optional thing to do btw)

  • You can check other metrics like precision, recall, f1-score, etc. to get an idea if your decision tree is not overfitting the data (or is giving importance to one class over the others)

  • Also, as a side note just be sure that your data does not suffer from class imbalance problem.

This is not an exhaustive list and not necessarily the best ways to check overfitting but you can give it a try.

Upvotes: 1

Related Questions