Reputation: 53
we are modelling a decision tree using both continous and binary inputs. We are analyzing weather effects on biking behavior. A linear regression suggests that "rain" has a huge impact on bike counts. Our rain variable is binary showing hourly status of rain.
Using rpart to create a decision tree does not include "rain" as a node, although we expect it to be very decisive on the number of bikes. This might be due to the classification of the rain variable. Rpart seems to prefer to use continous variables (like temperature) for decision nodes.
Is there anything we should know about how rpart determines whether to use continous or binary variables as decision node? Is it possible to control this selection of variables?
library("rpart")
fit <- rpart(bikecount ~ df.weather$temp+df.weather$weekday+df.weather$rain, data=training.data, method="class")
Upvotes: 0
Views: 1436
Reputation: 626
Function rpart
implements the CART algorithm of Breiman, Friedman, Olshen and Stone (1984), which is known to suffer from biased variable selection. I.e., given 2 or more variables that are equally predictive of the outcome, the variable with the largest number of unique values is most likely to be selected for splitting. See for example Loh and Shih (1997); Hothorn, Hornik & Zeileis (2006).
Unbiased recursive partitioning methods separate selection of 1) the splitting variable and 2) the splitting value, which solves this variable selection bias. Unbiased recursive partitioning has been implemented in the R package partykit.
If the code you provide above works for function rpart
(as it is unclear to me why the predictor variables in formula
include $
, the response variable does not, while the data
argument has been specified), you should be able to fit an unbiased classification tree as follows:
library("partykit")
ct <- ctree(bikecount ~ df.weather$temp + df.weather$weekday + df.weather$rain,
data=training.data)
plot(ct)
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1994). Classification and regression trees. Wadsworth, Monterey, CA.
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3), 651-674.
Loh, W. Y., & Shih, Y. S. (1997). Split selection methods for classification trees. Statistica Sinica 7(4), 815-840.
Upvotes: 1