Adding correct labels to decision trees

Question

I am using a Random Forests regressor in a Machine Learning Project. In order to better understand the logic of the predictions, I'd like to visualize some decision trees and check which features are used when.

In order to do so, I wrote the following code:

from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image

# Select one estimator from the Random Forests
estimator = best_estimators_regr['RandomForestRegressor'][0].estimators_[0]

export_graphviz(estimator, out_file=path+'tree.dot', 
           rounded=True, proportion=False, 
           precision=2, filled=True)
call(['dot', '-Tpng', path+'tree.dot', '-o', path+'tree.png', '-Gdpi=600'])
Image(filename=path+'tree.png')

The problem is that I use the max_features parameter when training the model, so I do not know which features are used in each tree. Thus, when plotting a tree, I simply get X[some_number]. Does this number correspond to the column in the original dataset? If not, how can I tell it to use the name of the columns rather than the number?

Vivek Kumar · Accepted Answer

The 'max_features' parameter in RandomForestClassifier is used to get the number of features at a time to find the best split. That parameter is passed to all the individual estimators (DecisionTreeClassifier). The base DecisionTreeClassifier objects all accept the whole data (where the samples are sampled from the training data but all column features are passed to each tree). The feature ordering is decided into that single DecisionTreeClassifier object. So no need to worry about that.

You can just use the feature_names parameter in export_graphviz to pass the names of each features for all your features.

Adding correct labels to decision trees

Answers (1)

Related Questions