Reputation: 300
A decision tree splits nodes until some breaking conditions and uses the mean of the values in any node as prediction.
I would like to get all the values in such a node, not just the mean, to then perform more complex operations. I am using sklearn. I did not find any answers on that, just a way to get the mean of all nodes by using DecisionTreeRegressor.tree_.value
.
How to do so?
Upvotes: 0
Views: 2035
Reputation: 300
Thanks to @desertnaut we have a really good answer. For people who want a pandas-based solution, I suggest the following code :
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import pandas as pd
## Dummy data code provided by desertnaut
rng = np.random.RandomState(1) # for reproducibility
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))
## Assuming X and y to be pd.DataFrame
X, y = pd.DataFrame(X, columns=['input']), pd.DataFrame(y, columns=['output'])
## Train a regression tree
estimator = DecisionTreeRegressor(max_depth=3)
estimator.fit(X, y)
leaf_index = pd.DataFrame(estimator.apply(X), columns=['leaf_index'], index=y.index)
leaf_df = pd.concat([leaf_index, y], axis=1).groupby('leaf_index')\
.apply(lambda x: x['output'].unique())\
.to_frame('leaf_values').reset_index()
leaf_df['leaf_size'] = leaf_df.leaf_values.apply(len)
Jupyter shows the following dataframe, as you can see we got the same results as those of desertnaut.
After that, it's pretty simple to get the leaf samples corresponding to a given observation x.
leaf_df.loc[leaf_index == estimator.apply(x), 'leaf_values']
Upvotes: 1
Reputation: 60321
AFAIK there is not any API method for this, but you can certainly get them programmatically.
Let's make some dummy data and build a regression tree first to demonstrate this:
import numpy as np
from sklearn.tree import DecisionTreeRegressor, export_graphviz
# dummy data
rng = np.random.RandomState(1) # for reproducibility
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))
estimator = DecisionTreeRegressor(max_depth=3)
estimator.fit(X, y)
import graphviz
dot_data = export_graphviz(estimator, out_file=None)
graph = graphviz.Source(dot_data)
graph
Here is a plot of our decision tree:
from which it is apparent that we have 8 leaves, with the number of samples and the mean of each one depicted.
The key command here is apply
:
on_leaf = estimator.apply(X)
on_leaf
# result:
array([ 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 13, 13, 13,
13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14])
on_leaf
has a length equal to our data X
and outcomes y
; it gives the indices of the nodes where each sample has ended up (all nodes in on_leaf
being terminal nodes, i.e. leaves). The number of its unique values is equal to the number or our leaves, here 8:
len(np.unique(on_leaf))
# 8
and on_leaf[k]
gives the number of node where y[k]
ends up.
Now we can get the actual y
values for each one of the 8 leaves as:
leaves = []
for i in np.unique(on_leaf):
leaves.append(y[np.argwhere(on_leaf==i)])
len(leaves)
# 8
Let's verify that, in accordance with our plot, the first leaf has only one sample with the value of -1.149
(since it is a single-sample leaf, the value of the sample is equal to the mean):
leaves[0]
# array([[-1.1493464]])
Looks good. What about the 2nd leaf, with 10 samples and a mean value of -0.173
?
leaves[1]
# result:
array([[ 0.09131401],
[ 0.09668352],
[ 0.13651039],
[ 0.19403525],
[-0.12383814],
[ 0.26365828],
[ 0.41252216],
[ 0.44546446],
[ 0.47215529],
[-0.26319138]])
len(leaves[1])
# 10
leaves[1].mean()
# 0.17253138570808904
And so on - a final check for the last leaf (#7), with 4 samples and a mean of -0.99
:
leaves[7]
# result:
array([[-0.99994398],
[-0.99703245],
[-0.99170146],
[-0.9732277 ]])
leaves[7].mean()
# -0.9904763973694366
What you need with data X
, outcomes y
, and a decision tree regressor estimator
is:
on_leaf = estimator.apply(X)
leaves = []
for i in np.unique(on_leaf):
leaves.append(y[np.argwhere(on_leaf==i)])
Upvotes: 1