Reputation: 33
I'm getting duplicated feature and threshold (CO2) when examining the decision tree from a random forest model. The code to visualize the tree is the following:
estimator = model.estimators_[10]
from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot',
feature_names = ['pdo', 'pna', 'lat', 'lon', 'ele', 'co2'],
class_names = 'disWY',
rounded = False, proportion = False,
precision = 3, filled = True)
# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=300'])
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')
It is clear that CO2 and -0.69 are used twice. I don't understand how this is possible. Anyone has any idea?
Should it be different threshold for the same feature?
Upvotes: 3
Views: 133
Reputation: 4273
This is probably a rounding error.
It's a little contrived, but here's a minimal way to reproduce this with RandomForestRegressor
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import export_graphviz
X = np.array([[-0.6901, 4.123],
[-0.6902, 5.456],
[-0.6903, 6.789],
[-0.6904, 7.012]])
y = np.array([0.0, 1.0, 1.0, 0.0])
reg = RandomForestRegressor(random_state=42).fit(X, y)
export_graphviz(reg.estimators_[6], out_file=f"tree6.dot", precision=3, filled=True)
# dot -Tpng tree6.dot -o tree6.png
If instead we passed a higher precision=8
when calling export_graphviz()
we would see something like this:
Upvotes: 2