Rui Tongyu
Rui Tongyu

Reputation: 89

caret rpart decision tree plotting result

I am training a decision tree model based on the heart disease data from Kaggle.

Since I am also building other models using 10-fold CV, I am trying to use caret package with rpart method to build the tree. However, the plot result is weird as "thalium" should be a factor. Why does it show "thaliumnormal <0.5"? Does this mean that if "thalium" == normal" then take the left route "yes", otherwise right route "no"?

Many thanks!

caret rpart decision tree plot using fancyRpartPlot

Edits: I apologize for not providing enough background info, which seemed to cause some confusion. "thalium" is a variable that represents a technique used to detect coronary stenosis (aka narrowing). It's a factor with three levels (normal, fixed defect, reversible defect).

data structure

In addition, I would like to make the graph more readable e.g. instead of "thaliumnormal < 0.5", it should be something like "thalium = normal". I could achieve this goal through using rpart directly (see below).

rpart decision tree plot

However, you probably have noticed that the tree is different, despite I used the recommended cp value with caret rpart CV 10 folds (see the code below).

code recommended cp, used for rpart tree using fancyRpartplot

I understand that these two packages may result in some differences. Ideally, I could use caret with method rpart to build the tree so that it aligns with other models built in caret. Does anyone know how I could make the plot label for the tree model built with caret rpart easier to understand?

Upvotes: 7

Views: 1933

Answers (2)

sconfluentus
sconfluentus

Reputation: 4993

It would help if there were some data, like dput(head(data)) to show us what your data really looks like or a str(data) to show the levels of variables and data types.

But likely (without having seen it) the variable is thallium and one level is normal and the table has selected a LEVEL of the variable thallium and is evaluating, if something is that level normal or not.

The tree treats categorical variables as dummies by level and makes a decision based on being >= .5 or < .5 and 0 is always less and 1 is always more.

By design most tree algorithms choose the cut-off for each of the variables (including a dummy 0./1) that creates the most purity (moves the most observations to one side or another and closer to classification) and picks a point midway between the two values which create the greatest separation in groups.

With a binary variable, that split is at .5 because it is midway between the two different values a level can take 0 and 1.

Upvotes: 5

cmirian
cmirian

Reputation: 2253

Your factor thaliumnormal is either 0 or 1, which represent yes or no - correct?

In that case, rpart takes the midvalue 0.5 so that all decision of 0 or 1 is either above or below 0.5.

Values below the cut-off, in this case 0.5, always turns left. So thaliumnormal==0 turns left, yes.

You can see the same example as for sex

Upvotes: 0

Related Questions