Bukowski
Bukowski

Reputation: 53

Decision Tree in R with binary and continuous input

we are modelling a decision tree using both continous and binary inputs. We are analyzing weather effects on biking behavior. A linear regression suggests that "rain" has a huge impact on bike counts. Our rain variable is binary showing hourly status of rain.

Using rpart to create a decision tree does not include "rain" as a node, although we expect it to be very decisive on the number of bikes. This might be due to the classification of the rain variable. Rpart seems to prefer to use continous variables (like temperature) for decision nodes.

Is there anything we should know about how rpart determines whether to use continous or binary variables as decision node? Is it possible to control this selection of variables?

library("rpart") fit <- rpart(bikecount ~ df.weather$temp+df.weather$weekday+df.weather$rain, data=training.data, method="class")

Upvotes: 0

Views: 1436

Answers (1)

Marjolein Fokkema
Marjolein Fokkema

Reputation: 626

Function rpart implements the CART algorithm of Breiman, Friedman, Olshen and Stone (1984), which is known to suffer from biased variable selection. I.e., given 2 or more variables that are equally predictive of the outcome, the variable with the largest number of unique values is most likely to be selected for splitting. See for example Loh and Shih (1997); Hothorn, Hornik & Zeileis (2006).

Unbiased recursive partitioning methods separate selection of 1) the splitting variable and 2) the splitting value, which solves this variable selection bias. Unbiased recursive partitioning has been implemented in the R package partykit.

If the code you provide above works for function rpart (as it is unclear to me why the predictor variables in formula include $, the response variable does not, while the data argument has been specified), you should be able to fit an unbiased classification tree as follows:

library("partykit")
ct <- ctree(bikecount ~ df.weather$temp + df.weather$weekday + df.weather$rain, 
            data=training.data)
plot(ct)

References

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1994). Classification and regression trees. Wadsworth, Monterey, CA.

Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3), 651-674.

Loh, W. Y., & Shih, Y. S. (1997). Split selection methods for classification trees. Statistica Sinica 7(4), 815-840.

Upvotes: 1

Related Questions