Reputation: 89
I am training a decision tree model based on the heart disease data from Kaggle.
Since I am also building other models using 10-fold CV, I am trying to use caret package with rpart method to build the tree. However, the plot result is weird as "thalium" should be a factor. Why does it show "thaliumnormal <0.5"? Does this mean that if "thalium" == normal" then take the left route "yes", otherwise right route "no"?
Many thanks!
Edits: I apologize for not providing enough background info, which seemed to cause some confusion. "thalium" is a variable that represents a technique used to detect coronary stenosis (aka narrowing). It's a factor with three levels (normal, fixed defect, reversible defect).
In addition, I would like to make the graph more readable e.g. instead of "thaliumnormal < 0.5", it should be something like "thalium = normal". I could achieve this goal through using rpart directly (see below).
However, you probably have noticed that the tree is different, despite I used the recommended cp value with caret rpart CV 10 folds (see the code below).
I understand that these two packages may result in some differences. Ideally, I could use caret with method rpart to build the tree so that it aligns with other models built in caret. Does anyone know how I could make the plot label for the tree model built with caret rpart easier to understand?
Upvotes: 7
Views: 1933
Reputation: 4993
It would help if there were some data, like dput(head(data))
to show us what your data really looks like or a str(data)
to show the levels of variables and data types.
But likely (without having seen it) the variable is thallium
and one level is normal
and the table has selected a LEVEL of the variable thallium
and is evaluating, if something is that level normal
or not.
The tree treats categorical variables as dummies by level and makes a decision based on being >= .5 or < .5 and 0 is always less and 1 is always more.
By design most tree algorithms choose the cut-off for each of the variables (including a dummy 0./1) that creates the most purity (moves the most observations to one side or another and closer to classification) and picks a point midway between the two values which create the greatest separation in groups.
With a binary variable, that split is at .5 because it is midway between the two different values a level can take 0 and 1.
Upvotes: 5
Reputation: 2253
Your factor thaliumnormal
is either 0 or 1, which represent yes or no - correct?
In that case, rpart
takes the midvalue 0.5 so that all decision of 0
or 1
is either above or below 0.5
.
Values below the cut-off, in this case 0.5, always turns left. So thaliumnormal==0
turns left, yes.
You can see the same example as for sex
Upvotes: 0