Reputation: 365
I am trying to build a decision tree for a prediction model on the following dataset:
And here is my code:
fitTree = rpart(classLabel ~ from_station_id + start_day + start_time
+ gender + age, method = "class", data=d)
fancyRpartPlot(fitTree)
But the result decision tree has used only one of the attributes (from_station_id) as the 'splitting attribute' and did not care about the values of other attributes (start_day, start_time, gender, age). Here is the result:
Click to enlarge.
What am I doing wrong?
Upvotes: 1
Views: 2624
Reputation: 86
Your syntax looks correct. Based on the snippet of your dataset, classLabel and from_station_id may be closely correlated (and maybe gender, too?). In this case, from_station_id will be the best predictor for your classLabel, and the other variables are just not informative (or are also correlated but being masked), and will not show up on the tree. Try:
summary.rpart(fitTree)
This will show you better how the splits were made and the variable importance, which can help you evaluate masking. You should avoid correlated predictors, as they result in masking and can interfere with interactions.
If you are only seeing from_station_id in the summary, then you know it is ignoring the other variables, but I am not sure why it would.
Upvotes: 1