Reputation: 1587
I'm struggling with understanding output of tree classification in rpart. I don't understand how 'root node error' is calculated(one of the output of printcp function). I couldn't find it definition also in rpart package description.
On example I loaded titanic data:
library(titanic)
library(rpart)
tt<-titanic_train
table(tt$Survived)
So we have 549 people who survived and 342 people who died. Total 891 people.
fit<-rpart(Survived ~Pclass+Sex+Age+ SibSp+Parch+Fare+Embarked , data=tt)
printcp(dend)
Gives result:
Regression tree:
rpart(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
Fare + Embarked, data = tt)
Variables actually used in tree construction:
[1] Age Fare Pclass Sex SibSp
Root node error: 210.73/891 = 0.23651
n= 891
CP nsplit rel error xerror xstd
1 0.295231 0 1.00000 1.00538 0.016124
2 0.073942 1 0.70477 0.70896 0.033228
3 0.027124 2 0.63083 0.63570 0.031752
4 0.026299 3 0.60370 0.62105 0.032815
5 0.023849 4 0.57740 0.61154 0.032884
6 0.021091 5 0.55356 0.58294 0.032127
7 0.010000 6 0.53246 0.57097 0.032402
Here root node error mean misclassification error at the beginning before adding any nodes, am I right? So if I assume that everyone survived I will be wrong in 342 cases out of 891, so root node error should be 342/891. And in the output I have 210.73/891.
I would be grateful with helping me understand what 210.73 means in Root node error and how it was calculated on example this titanic data. I was searching for it all day and can't find any explanation.
Thank you in advance for help.
Upvotes: 3
Views: 7920
Reputation: 137
Root node error is the percent of correctly sorted records at the first (root) splitting node.
For more information see Understanding the Outputs of the Decision Tree Tool.
Upvotes: 1