Jonah
Jonah

Reputation: 21

R multiway split trees using ctree {partykit}

I want to analyze my data with a conditional inference trees using the ctree function from partykit. I specifically went for this function because - if I understood correctly - it's one of the only ones allowing multiway splits. I need this option because all of my variables are multilevel (unordered) categorical variables.

However, trying to enable multiway split using ctree_control gives the following error:

aufprallentree <- ctree(case ~., data = aufprallen,
  control = ctree_control(minsplit = 10, minbucket = 5, multiway = TRUE))
## Error in 1:levels(x) : NA/NaN argument
## In addition: Warning messages:
## 1: In 1:levels(x) :
##   numerical expression has 4 elements: only the first used
## 2: In partysplit(as.integer(isel), index = 1:levels(x)) :
##   NAs introduced by coercion

Anyone knows how to solve this? Or if I'm mistaken and ctree does not allow multiway splits?

For clarity, an overview of my data: (no NAs)

str(aufprallen)
## 'data.frame':    299 obs. of  10 variables:
##  $ prep          : Factor w/ 6 levels "an","auf","hinter",..: 2 2 2 2 2 2 1 2 2 2 ...
##  $ prep_main     : Factor w/ 2 levels "auf","other": 1 1 1 1 1 1 2 1 1 1 ...
##  $ case          : Factor w/ 2 levels "acc","dat": 1 1 2 1 1 1 2 1 1 1 ...
##  $ sense         : Factor w/ 3 levels "crashdown","crashinto",..: 2 2 1 3 2 2 1 2 1 2 ...
##  $ PO_type       : Factor w/ 4 levels "object","region",..: 4 4 3 1 4 4 3 4 3 4 ...
##  $ PO_type2      : Factor w/ 3 levels "object","region",..: 1 1 3 1 1 1 3 1 3 1 ...
##  $ perfectivity  : Factor w/ 2 levels "imperfective",..: 1 1 2 2 1 1 1 1 1 1 ...
##  $ mit_Körperteil: Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PP_place      : Factor w/ 4 levels "back","front",..: 4 1 1 1 1 1 1 1 1 1 ...
##  $ PP_place_main : Factor w/ 3 levels "marked","rel",..: 2 3 3 3 3 3 3 3 3 3 ...

Thanks in advance!

Upvotes: 2

Views: 3515

Answers (1)

Achim Zeileis
Achim Zeileis

Reputation: 17183

A couple of remarks:

  • The error with 1:levels(x) was a bug in ctree. The code should have been 1:nlevels(x). I just fixed this on R-Forge - so you can check out the SVN from there and manually install the package if you want to use the option now. (Contact me off-list if you need more details on this.) Torsten will probably also make a new CRAN release in the next weeks.

  • Another function that can learn binary classification trees with multiway splits is glmtree in the partykit package. The code would be glmtree(case ~ ., data = aufprallen, family = binomial, catsplit = "multiway", minsize = 5). It uses parameter instability tests instead of conditional inference for association to determine the splitting variables and adopts the formal likelihood. But in many cases the results are fairly similar to ctree.

  • In both algorithms, the multiway splits are very basic: If a categorical variable is selected for splitting, then no split selection is done at all. Instead all categories get their own daughter node. There are algorithms that try to determine optimal groupings of categories with a data-driven number of daughter nodes (between 2 and the number of categories).

  • Even though you have categorical predictor variables with more than two levels you don't need multiway splits. Many algorithms just use binary splits because any multiway split can be represented by a sequence of binary splits. In many datasets, however, it turns out that it is beneficial to not separate all but just a few of the categories in a splitting factor.

Overall my recommendation would be to start out with standard conditional inference trees with binary splits only. And only if it turns out that this leads to many binary splits in the same factor, then I would go on to explore multiway splits.

Upvotes: 4

Related Questions