kakarotto
kakarotto

Reputation: 300

Get all values of a terminal (leaf) node in a DecisionTreeRegressor

A decision tree splits nodes until some breaking conditions and uses the mean of the values in any node as prediction.

I would like to get all the values in such a node, not just the mean, to then perform more complex operations. I am using sklearn. I did not find any answers on that, just a way to get the mean of all nodes by using DecisionTreeRegressor.tree_.value.

How to do so?

Upvotes: 0

Views: 2035

Answers (2)

kakarotto
kakarotto

Reputation: 300

Thanks to @desertnaut we have a really good answer. For people who want a pandas-based solution, I suggest the following code :

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import pandas as pd

## Dummy data code provided by desertnaut
rng = np.random.RandomState(1)  # for reproducibility
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))

## Assuming X and y to be pd.DataFrame
X, y = pd.DataFrame(X, columns=['input']), pd.DataFrame(y, columns=['output'])

## Train a regression tree
estimator = DecisionTreeRegressor(max_depth=3)
estimator.fit(X, y)

leaf_index = pd.DataFrame(estimator.apply(X), columns=['leaf_index'], index=y.index)
leaf_df = pd.concat([leaf_index, y], axis=1).groupby('leaf_index')\
                                            .apply(lambda x: x['output'].unique())\
                                            .to_frame('leaf_values').reset_index()
leaf_df['leaf_size'] = leaf_df.leaf_values.apply(len) 

Jupyter shows the following dataframe, as you can see we got the same results as those of desertnaut.

enter image description here

After that, it's pretty simple to get the leaf samples corresponding to a given observation x.

leaf_df.loc[leaf_index == estimator.apply(x), 'leaf_values']

Upvotes: 1

desertnaut
desertnaut

Reputation: 60321

AFAIK there is not any API method for this, but you can certainly get them programmatically.

Let's make some dummy data and build a regression tree first to demonstrate this:

import numpy as np
from sklearn.tree import DecisionTreeRegressor, export_graphviz

# dummy data
rng = np.random.RandomState(1)  # for reproducibility
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))

estimator = DecisionTreeRegressor(max_depth=3)
estimator.fit(X, y)

import graphviz 
dot_data = export_graphviz(estimator, out_file=None) 

graph = graphviz.Source(dot_data) 
graph

Here is a plot of our decision tree:

enter image description here

from which it is apparent that we have 8 leaves, with the number of samples and the mean of each one depicted.

The key command here is apply:

on_leaf = estimator.apply(X)
on_leaf
# result:
array([ 3,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  6,  6,  6,  6,  6,  6,
        6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,
        6,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,
       10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 13, 13, 13,
       13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14])

on_leaf has a length equal to our data X and outcomes y; it gives the indices of the nodes where each sample has ended up (all nodes in on_leaf being terminal nodes, i.e. leaves). The number of its unique values is equal to the number or our leaves, here 8:

len(np.unique(on_leaf))
# 8

and on_leaf[k] gives the number of node where y[k] ends up.

Now we can get the actual y values for each one of the 8 leaves as:

leaves = []
for i in np.unique(on_leaf):
  leaves.append(y[np.argwhere(on_leaf==i)]) 

len(leaves)
# 8

Let's verify that, in accordance with our plot, the first leaf has only one sample with the value of -1.149 (since it is a single-sample leaf, the value of the sample is equal to the mean):

leaves[0]
# array([[-1.1493464]])

Looks good. What about the 2nd leaf, with 10 samples and a mean value of -0.173?

leaves[1]
# result:
array([[ 0.09131401],
       [ 0.09668352],
       [ 0.13651039],
       [ 0.19403525],
       [-0.12383814],
       [ 0.26365828],
       [ 0.41252216],
       [ 0.44546446],
       [ 0.47215529],
       [-0.26319138]])

len(leaves[1])
# 10

leaves[1].mean()
# 0.17253138570808904

And so on - a final check for the last leaf (#7), with 4 samples and a mean of -0.99:

leaves[7]
# result:
array([[-0.99994398],
       [-0.99703245],
       [-0.99170146],
       [-0.9732277 ]])

leaves[7].mean()
# -0.9904763973694366

To summarize:

What you need with data X, outcomes y, and a decision tree regressor estimator is:

on_leaf = estimator.apply(X)

leaves = []
for i in np.unique(on_leaf):
  leaves.append(y[np.argwhere(on_leaf==i)]) 

Upvotes: 1

Related Questions