Reputation: 341
I am building a tree classifier and I would like to check and fix the possible overfitting. These are the calcuations:
dtc = DecisionTreeClassifier(max_depth=3,min_samples_split=3,min_samples_leaf=1, random_state=0)
dtc_fit = dtc.fit(X_train, y_train)
print("Accuracy using Decision Tree:" ,round(score, 1), "%")
('Accuracy using Decision Tree:', 92.2, '%')
scores = cross_val_score(dtc_fit, X_train, y_train, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.91 (+/- 0.10)
What are the possible values I could fix to get a better result or perhaps these are already fine?
Thank you for the help, I am a beginner therefore unsure of the outcome.
Upvotes: 1
Views: 1813
Reputation: 8801
Not sure exactly if it is overfitting or not, but you can give gridSearchCV a try for the following reasons
You can add various parameters by making a dictionary of various parameters and the values that they can have like this
from sklearn.grid_search import GridSearchCV
parameters_dict = {"max_depth": [2,5,6,10], "min_samples_split" : [0.1, 0.2, 0.3, 0.4], "min_samples_leaf" = [0.1, 0.2, 0.3, 0.4], "criterion": ["gini","entropy"]}
dtc = DecisionTreeClassifier(random_state= 0)
grid_obj = GridSearchCV(estimator=dtc,param_grid=parameters_dict, cv=10)
grid_obj.fit(X_train,y_train)
#Extract the best classifier
best_clf = grid_obj.best_estimator_
Also you can try Recursive Feature Elimination with CV to find the best features. (This is an optional thing to do btw)
You can check other metrics like precision, recall, f1-score, etc. to get an idea if your decision tree is not overfitting the data (or is giving importance to one class over the others)
Also, as a side note just be sure that your data does not suffer from class imbalance problem.
This is not an exhaustive list and not necessarily the best ways to check overfitting but you can give it a try.
Upvotes: 1