Donbeo
Donbeo

Reputation: 17617

R decision tree using all the variables

I would like to perform a decision tree analysis. I want that the decision tree uses all the variables in the model.

I also need to plot the decision tree. How can I do that in R?

This is a sample of my dataset

> head(d)
  TargetGroup2000 TargetGroup2012 SmokingGroup_Kai PA_Score wheeze3 asthma3 tres3
1               2               2                4        2       0       0     0
2               2               2                4        3       1       0     0
3               2               2                5        1       0       0     0
4               2               2                4        2       1       0     0
5               2               3                3        1       0       0     0
6               2               3                3        2       0       0     0
> 

I would like to use the formula

myFormula <- wheeze3 ~ TargetGroup2000 + TargetGroup2012 + SmokingGroup_Kai + PA_Score

Note that all the variables are categorical.

EDIT: My problem is that some variables do not appear in the final decision tree. The deap of the tree should be defined by a penalty parameter alpha. I do not know how to set this penalty in order that all the variables appear in my model.
In other words I would like a model that minimize the training error.

Upvotes: 1

Views: 17748

Answers (3)

David Arenburg
David Arenburg

Reputation: 92282

As mentioned above, if you want to run the tree on all the variables you should write it as

ctree(wheeze3 ~ ., d)

The penalty you mentioned is located at the ctree_control(). You can set the P-value there and the minimum split and bucket size. So in order to maximize the chance that all the variables will be included you should do something like that:

ctree(wheeze3 ~ ., d, controls = ctree_control(mincriterion = 0.85, minsplit = 0, minbucket = 0))

The problem is that you'll get into risk of overfitting.

The last thing you need to understand is, that the reason that you may not see all the variables in the output of the tree is because they don't have a significant influence on the dependend variable. Unlike linear or logistic regression, that will show all the variables and give you the P-value in order to determine if they are significant or not, the decision tree does not return the unsiginifcant variables, i.e, it doesn't split by them.

For better understanding of how ctree works, please take a look here: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees

Upvotes: 6

ctbrown
ctbrown

Reputation: 2361

The easiest way is to use the rpart package that is part of the core R.

library(rpart) 
model <- rpart( wheeze3 ~ ., data=d ) 

summary(model)
plot(model)
text(model)

The . in the formula argument means use all the other variables as independent variables.

Upvotes: 2

          plot(ctree(myFormula~., data=sta))

Upvotes: 0

Related Questions