Mauro Nogueira
Mauro Nogueira

Reputation: 131

Print decision tree and feature_importance when using BaggingClassifier

Obtaining the decision tree and the important features can be easy when using DecisionTreeClassifier in scikit learn. However I am not able to obtain none of them if I and bagging function, e.g., BaggingClassifier.

Since we need to fit the model using the BaggingClassifier, I can not return the results (print the trees (graphs), feature_importances_, ...) related to the DecisionTreeClassifier.

Hier is my script:

seed = 7
n_iterations = 199
DTC = DecisionTreeClassifier(random_state=seed,
                                                 max_depth=None,
                                                 min_impurity_split= 0.2,
                                                 min_samples_leaf=6,
                                                 max_features=None, #If None, then max_features=n_features.
                                                 max_leaf_nodes=20,
                                                 criterion='gini',
                                                 splitter='best',
                                                 )

#parametersDTC = {'max_depth':range(3,10), 'max_leaf_nodes':range(10, 30)}
parameters = {'max_features':range(1,200)}
dt = RandomizedSearchCV(BaggingClassifier(base_estimator=DTC,
                              #max_samples=1,
                              n_estimators=100,
                              #max_features=1,
                              bootstrap = False,
                              bootstrap_features = True, random_state=seed),
                        parameters, n_iter=n_iterations, n_jobs=14, cv=kfold,
                        error_score='raise', random_state=seed, refit=True) #min_samples_leaf=10

# Fit the model

fit_dt= dt.fit(X_train, Y_train)
print(dir(fit_dt))
tree_model = dt.best_estimator_

# Print the important features (NOT WORKING)

features = tree_model.feature_importances_
print(features)

rank = np.argsort(features)[::-1]
print(rank[:12])
print(sorted(list(zip(features))))

# Importing the image (NOT WORKING)
from sklearn.externals.six import StringIO

tree.export_graphviz(dt.best_estimator_, out_file='tree.dot') # necessary to plot the graph

dot_data = StringIO() # need to understand but it probably relates to read of strings
tree.export_graphviz(dt.best_estimator_, out_file=dot_data, filled=True, class_names= target_names, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

img = Image(graph.create_png())
print(dir(img)) # with dir we can check what are the possibilities in graph.create_png

with open("my_tree.png", "wb") as png:
    png.write(img.data)

I obtain erros like: 'BaggingClassifier' object has no attribute 'tree_' and 'BaggingClassifier' object has no attribute 'feature_importances'. Does anyone know how can I obtain them? thanks.

Upvotes: 2

Views: 3248

Answers (1)

Miriam Farber
Miriam Farber

Reputation: 19634

Based on the documentation, BaggingClassifier object indeed doesn't have the attribute 'feature_importances'. You could still compute it yourself as described in the answer to this question: Feature importances - Bagging, scikit-learn

You can access the trees that were produced during the fitting of BaggingClassifier using the attribute estimators_, as in the following example:

from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier


iris = datasets.load_iris()
clf = BaggingClassifier(n_estimators=3)
clf.fit(iris.data, iris.target)
clf.estimators_

clf.estimators_ is a list of the 3 fitted decision trees:

[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features=None, max_leaf_nodes=None,
             min_impurity_split=1e-07, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             presort=False, random_state=1422640898, splitter='best'),
 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features=None, max_leaf_nodes=None,
             min_impurity_split=1e-07, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             presort=False, random_state=1968165419, splitter='best'),
 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features=None, max_leaf_nodes=None,
             min_impurity_split=1e-07, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             presort=False, random_state=2103976874, splitter='best')]

So you can iterate over the list and access each one of the trees.

Upvotes: 2

Related Questions