Reputation: 21
I want to analyze my data with a conditional inference trees using the ctree
function from partykit. I specifically went for this function because - if I understood correctly - it's one of the only ones allowing multiway splits. I need this option because all of my variables are multilevel (unordered) categorical variables.
However, trying to enable multiway split using ctree_control
gives the following error:
aufprallentree <- ctree(case ~., data = aufprallen,
control = ctree_control(minsplit = 10, minbucket = 5, multiway = TRUE))
## Error in 1:levels(x) : NA/NaN argument
## In addition: Warning messages:
## 1: In 1:levels(x) :
## numerical expression has 4 elements: only the first used
## 2: In partysplit(as.integer(isel), index = 1:levels(x)) :
## NAs introduced by coercion
Anyone knows how to solve this? Or if I'm mistaken and ctree
does not allow multiway splits?
For clarity, an overview of my data: (no NAs)
str(aufprallen)
## 'data.frame': 299 obs. of 10 variables:
## $ prep : Factor w/ 6 levels "an","auf","hinter",..: 2 2 2 2 2 2 1 2 2 2 ...
## $ prep_main : Factor w/ 2 levels "auf","other": 1 1 1 1 1 1 2 1 1 1 ...
## $ case : Factor w/ 2 levels "acc","dat": 1 1 2 1 1 1 2 1 1 1 ...
## $ sense : Factor w/ 3 levels "crashdown","crashinto",..: 2 2 1 3 2 2 1 2 1 2 ...
## $ PO_type : Factor w/ 4 levels "object","region",..: 4 4 3 1 4 4 3 4 3 4 ...
## $ PO_type2 : Factor w/ 3 levels "object","region",..: 1 1 3 1 1 1 3 1 3 1 ...
## $ perfectivity : Factor w/ 2 levels "imperfective",..: 1 1 2 2 1 1 1 1 1 1 ...
## $ mit_Körperteil: Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 1 ...
## $ PP_place : Factor w/ 4 levels "back","front",..: 4 1 1 1 1 1 1 1 1 1 ...
## $ PP_place_main : Factor w/ 3 levels "marked","rel",..: 2 3 3 3 3 3 3 3 3 3 ...
Thanks in advance!
Upvotes: 2
Views: 3515
Reputation: 17183
A couple of remarks:
The error with 1:levels(x)
was a bug in ctree
. The code should have been 1:nlevels(x)
. I just fixed this on R-Forge - so you can check out the SVN from there and manually install the package if you want to use the option now. (Contact me off-list if you need more details on this.) Torsten will probably also make a new CRAN release in the next weeks.
Another function that can learn binary classification trees with multiway splits is glmtree
in the partykit
package. The code would be glmtree(case ~ ., data = aufprallen, family = binomial, catsplit = "multiway", minsize = 5)
. It uses parameter instability tests instead of conditional inference for association to determine the splitting variables and adopts the formal likelihood. But in many cases the results are fairly similar to ctree
.
In both algorithms, the multiway splits are very basic: If a categorical variable is selected for splitting, then no split selection is done at all. Instead all categories get their own daughter node. There are algorithms that try to determine optimal groupings of categories with a data-driven number of daughter nodes (between 2 and the number of categories).
Even though you have categorical predictor variables with more than two levels you don't need multiway splits. Many algorithms just use binary splits because any multiway split can be represented by a sequence of binary splits. In many datasets, however, it turns out that it is beneficial to not separate all but just a few of the categories in a splitting factor.
Overall my recommendation would be to start out with standard conditional inference trees with binary splits only. And only if it turns out that this leads to many binary splits in the same factor, then I would go on to explore multiway splits.
Upvotes: 4