Duplicated feature and criteria from sklearn RandomForest when examining the decision path

Question

I'm getting duplicated feature and threshold (CO2) when examining the decision tree from a random forest model. The code to visualize the tree is the following:

estimator = model.estimators_[10]
from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = ['pdo', 'pna', 'lat', 'lon', 'ele', 'co2'],
                class_names = 'disWY',
                rounded = False, proportion = False, 
                precision = 3, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=300'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

It is clear that CO2 and -0.69 are used twice. I don't understand how this is possible. Anyone has any idea?

screen shot of decision tree

Should it be different threshold for the same feature?

Alexander L. Hayes · Accepted Answer

This is probably a rounding error.

It's a little contrived, but here's a minimal way to reproduce this with RandomForestRegressor

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import export_graphviz

X = np.array([[-0.6901, 4.123],
              [-0.6902, 5.456],
              [-0.6903, 6.789],
              [-0.6904, 7.012]])
y = np.array([0.0, 1.0, 1.0, 0.0])

reg = RandomForestRegressor(random_state=42).fit(X, y)

export_graphviz(reg.estimators_[6], out_file=f"tree6.dot", precision=3, filled=True)
# dot -Tpng tree6.dot -o tree6.png

If instead we passed a higher precision=8 when calling export_graphviz() we would see something like this:

Duplicated feature and criteria from sklearn RandomForest when examining the decision path

Answers (1)

Related Questions