Rojin
Rojin

Reputation: 365

Decision Tree in R using rpart based on multiple splitting attributes

I am trying to build a decision tree for a prediction model on the following dataset:

enter image description here

And here is my code:

fitTree = rpart(classLabel ~ from_station_id + start_day + start_time
            + gender +  age, method = "class", data=d)
fancyRpartPlot(fitTree)

But the result decision tree has used only one of the attributes (from_station_id) as the 'splitting attribute' and did not care about the values of other attributes (start_day, start_time, gender, age). Here is the result:

Click to enlarge.

enter image description here

What am I doing wrong?

Upvotes: 1

Views: 2624

Answers (1)

RTB
RTB

Reputation: 86

Your syntax looks correct. Based on the snippet of your dataset, classLabel and from_station_id may be closely correlated (and maybe gender, too?). In this case, from_station_id will be the best predictor for your classLabel, and the other variables are just not informative (or are also correlated but being masked), and will not show up on the tree. Try:

summary.rpart(fitTree)

This will show you better how the splits were made and the variable importance, which can help you evaluate masking. You should avoid correlated predictors, as they result in masking and can interfere with interactions.

If you are only seeing from_station_id in the summary, then you know it is ignoring the other variables, but I am not sure why it would.

Upvotes: 1

Related Questions