Reputation: 376
I am using a Random Forests regressor in a Machine Learning Project. In order to better understand the logic of the predictions, I'd like to visualize some decision trees and check which features are used when.
In order to do so, I wrote the following code:
from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image
# Select one estimator from the Random Forests
estimator = best_estimators_regr['RandomForestRegressor'][0].estimators_[0]
export_graphviz(estimator, out_file=path+'tree.dot',
rounded=True, proportion=False,
precision=2, filled=True)
call(['dot', '-Tpng', path+'tree.dot', '-o', path+'tree.png', '-Gdpi=600'])
Image(filename=path+'tree.png')
The problem is that I use the max_features
parameter when training the model, so I do not know which features are used in each tree. Thus, when plotting a tree, I simply get X[some_number]
. Does this number correspond to the column in the original dataset? If not, how can I tell it to use the name of the columns rather than the number?
Upvotes: 0
Views: 569
Reputation: 36599
The 'max_features'
parameter in RandomForestClassifier
is used to get the number of features at a time to find the best split. That parameter is passed to all the individual estimators (DecisionTreeClassifier
). The base DecisionTreeClassifier
objects all accept the whole data (where the samples are sampled from the training data but all column features are passed to each tree). The feature ordering is decided into that single DecisionTreeClassifier
object. So no need to worry about that.
You can just use the feature_names
parameter in export_graphviz
to pass the names of each features for all your features.
Upvotes: 1