Reputation: 117
I am using rpart for building a classification tree. I want to develop my own pruning function based on some criteria on the leaf nodes. For instance, if one leaf node is not good regardind to some criteria (stability of parameters' estimation in my case) I want to climb in the tree structure and get the parent node of this leaf node (even if this node is not terminal). For this, I want to traverse the tree using path and I need to get the leaf nodes with their parents nodes, in order to climb the tree if necessary.
Let's look at this example :
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
> fit
n= 81
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 81 17 absent (0.79012346 0.20987654)
2) Start>=8.5 62 6 absent (0.90322581 0.09677419)
4) Start>=14.5 29 0 absent (1.00000000 0.00000000) *
5) Start< 14.5 33 6 absent (0.81818182 0.18181818)
10) Age< 55 12 0 absent (1.00000000 0.00000000) *
11) Age>=55 21 6 absent (0.71428571 0.28571429)
22) Age>=111 14 2 absent (0.85714286 0.14285714) *
23) Age< 111 7 3 present (0.42857143 0.57142857) *
3) Start< 8.5 19 8 present (0.42105263 0.57894737) *
With fit$frame I can get the information about leaf nodes :
fit$frame
var n wt dev yval complexity ncompete nsurrogate yval2.V1 yval2.V2 yval2.V3 yval2.V4 yval2.V5
1 Start 81 81 17 1 0.17647059 2 1 1.00000000 64.00000000 17.00000000 0.79012346 0.20987654
2 Start 62 62 6 1 0.01960784 2 2 1.00000000 56.00000000 6.00000000 0.90322581 0.09677419
4 <leaf> 29 29 0 1 0.01000000 0 0 1.00000000 29.00000000 0.00000000 1.00000000 0.00000000
5 Age 33 33 6 1 0.01960784 2 2 1.00000000 27.00000000 6.00000000 0.81818182 0.18181818
10 <leaf> 12 12 0 1 0.01000000 0 0 1.00000000 12.00000000 0.00000000 1.00000000 0.00000000
11 Age 21 21 6 1 0.01960784 2 0 1.00000000 15.00000000 6.00000000 0.71428571 0.28571429
22 <leaf> 14 14 2 1 0.01000000 0 0 1.00000000 12.00000000 2.00000000 0.85714286 0.14285714
23 <leaf> 7 7 3 2 0.01000000 0 0 2.00000000 3.00000000 4.00000000 0.42857143 0.57142857
3 <leaf> 19 19 8 2 0.01000000 0 0 2.00000000 8.00000000 11.00000000 0.42105263 0.57894737
I can get the correspondance of the rows in the data table with the corresponding leaf node it falls in using : fit$where
Now I want to get also the parents of a leaf node. I know that path.rpart gives me all the splits done in order to obtain the leaf node. For example for leaf node 23:
> path.rpart(fit, 23)
node number: 23
root
Start>=8.5
Start< 14.5
Age>=55
Age< 111
What I want to obtain is a path with the node numbers of the parents of node 23? How can I do this association?
Thank you in advance.
Upvotes: 3
Views: 1563
Reputation: 20811
You don't need any information about the tree since all nodes have the same pattern. Let's fit a more interesting tree:
(fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis,
cp = .0001, minsplit = 5))
# n= 81
#
# node), split, n, loss, yval, (yprob)
# * denotes terminal node
#
# 1) root 81 17 absent (0.79012346 0.20987654)
# 2) Start>=8.5 62 6 absent (0.90322581 0.09677419)
# 4) Start>=14.5 29 0 absent (1.00000000 0.00000000) *
# 5) Start< 14.5 33 6 absent (0.81818182 0.18181818)
# 10) Age< 55 12 0 absent (1.00000000 0.00000000) *
# 11) Age>=55 21 6 absent (0.71428571 0.28571429)
# 22) Age>=98 16 2 absent (0.87500000 0.12500000) *
# 23) Age< 98 5 1 present (0.20000000 0.80000000) *
# 3) Start< 8.5 19 8 present (0.42105263 0.57894737)
# 6) Age< 11.5 2 0 absent (1.00000000 0.00000000) *
# 7) Age>=11.5 17 6 present (0.35294118 0.64705882)
# 14) Start< 5.5 12 6 absent (0.50000000 0.50000000)
# 28) Age>=130.5 2 0 absent (1.00000000 0.00000000) *
# 29) Age< 130.5 10 4 present (0.40000000 0.60000000)
# 58) Age< 93 6 2 absent (0.66666667 0.33333333)
# 116) Number< 4.5 3 0 absent (1.00000000 0.00000000) *
# 117) Number>=4.5 3 1 present (0.33333333 0.66666667) *
# 59) Age>=93 4 0 present (0.00000000 1.00000000) *
# 15) Start>=5.5 5 0 present (0.00000000 1.00000000) *
If possible, each node will be split in two and be numbered node * 2 + 0:1
, so if you had a node numbered 5, its children will be 5 * 2 + 0:1
. Also note that using this pattern, no even numbered nodes will have children.
Therefore, given any node number, we can work back to find the parents:
parent(23)
# [1] 1 2 5 11 23
## children of the same node should have the same path
identical(head(parent(28), -1), head(parent(29), -1))
# [1] TRUE
parent <- function(x) {
if (x[1] != 1)
c(Recall(if (x %% 2 == 0L) x / 2 else (x - 1) / 2), x) else x
}
Upvotes: 2