goldisfine
goldisfine

Reputation: 4850

Combining DF and rpart$where?

If I do DF$where <- tree$where after fitting an rpart object using DF as my data, will each row be mapped to its corresponding leaf through the column where?

Thanks!

Upvotes: 1

Views: 866

Answers (1)

IRTFM
IRTFM

Reputation: 263301

As an example of how to demonstrate that this is possibly true (modulo my understanding of your question being correct), we work with the first example in ?rpart:

require(rpart)
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
kyphosis$where <- fit$where

> str(kyphosis)
'data.frame':   81 obs. of  5 variables:
 $ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
 $ Age     : int  71 158 128 2 1 1 61 37 113 59 ...
 $ Number  : int  3 3 4 5 4 2 2 3 2 6 ...
 $ Start   : int  5 14 5 1 15 16 17 16 16 12 ...
 $ where   : int  9 7 9 9 3 3 3 3 3 8 ...

> plot(fit)
> text(fit, use.n = TRUE)

enter image description here

And now look at some tables based on the 'where' vector and some logical tests:

First node:

> with(kyphosis, table(where, Start >= 8.5)) 


where FALSE TRUE
    3     0   29
    5     0   12
    7     0   14
    8     0    7
    9    19    0  # so this is the row that describes that split
> fit$frame[9,]
     var  n wt dev yval complexity ncompete nsurrogate   yval2.V1
3 <leaf> 19 19   8    2       0.01        0          0  2.0000000
    yval2.V2   yval2.V3   yval2.V4   yval2.V5 yval2.nodeprob
3  8.0000000 11.0000000  0.4210526  0.5789474      0.2345679

Second node:

> with(kyphosis, table(where, Start >= 8.5, Start>=14.5))
, ,  = FALSE


where FALSE TRUE
    3     0    0
    5     0   12
    7     0   14
    8     0    7
    9    19    0

, ,  = TRUE


where FALSE TRUE
    3     0   29
    5     0    0
    7     0    0
    8     0    0
    9     0    0

And this is the row of fit$frame that describes the second split:

> fit$frame[3,]
     var  n wt dev yval complexity ncompete nsurrogate   yval2.V1
4 <leaf> 29 29   0    1       0.01        0          0  1.0000000
    yval2.V2   yval2.V3   yval2.V4   yval2.V5 yval2.nodeprob
4 29.0000000  0.0000000  1.0000000  0.0000000      0.3580247

So I would characterize the value of fit$where as describing the "terminal nodes" which are being labeled as '<leaf>', which may or not be what you were calling the "nodes".

> fit$frame
      var  n wt dev yval complexity ncompete nsurrogate    yval2.V1
1   Start 81 81  17    1 0.17647059        2          1  1.00000000
2   Start 62 62   6    1 0.01960784        2          2  1.00000000
4  <leaf> 29 29   0    1 0.01000000        0          0  1.00000000
5     Age 33 33   6    1 0.01960784        2          2  1.00000000
10 <leaf> 12 12   0    1 0.01000000        0          0  1.00000000
11    Age 21 21   6    1 0.01960784        2          0  1.00000000
22 <leaf> 14 14   2    1 0.01000000        0          0  1.00000000
23 <leaf>  7  7   3    2 0.01000000        0          0  2.00000000
3  <leaf> 19 19   8    2 0.01000000        0          0  2.00000000
      yval2.V2    yval2.V3    yval2.V4    yval2.V5 yval2.nodeprob
1  64.00000000 17.00000000  0.79012346  0.20987654     1.00000000
2  56.00000000  6.00000000  0.90322581  0.09677419     0.76543210
4  29.00000000  0.00000000  1.00000000  0.00000000     0.35802469
5  27.00000000  6.00000000  0.81818182  0.18181818     0.40740741
10 12.00000000  0.00000000  1.00000000  0.00000000     0.14814815
11 15.00000000  6.00000000  0.71428571  0.28571429     0.25925926
22 12.00000000  2.00000000  0.85714286  0.14285714     0.17283951
23  3.00000000  4.00000000  0.42857143  0.57142857     0.08641975
3   8.00000000 11.00000000  0.42105263  0.57894737     0.23456790

Upvotes: 1

Related Questions