Reputation: 125
I am building a random forest in r using randomForest. All my features are categorical. For example, my feature "voting method in 2020 general election" has responses {"", "AB", "AP", "MB", "P"}. I would like to know if my trees are generally splitting between the empty string and the other responses (indicating that vote method is less important than whether a vote was recorded.)
I have been examining forest$xbestsplit which seems to contain what I need, but I'm not sure how to interpret it. Calling forest$xbestsplit on one forest produces this output . Basically, it gives me a column for each of my 500 trees. Each column some number of rows. I'm not sure if the rows represent nodes and how to interpret the given numeric values, since my responses are categorical.
I built a forest with just one feature as an example:
mini_data <- data.frame(vote_method = c('MB', 'MB', '', 'MB', 'MB', 'MB', 'AP', '', 'AP', 'MB'),
target = c(1, 1, 0, 1, 1, 0, 0, 0, 1, 1))
mini_data$vote_method <- as.factor(mini_data$vote_method)
mini_data$target <- as.factor(mini_data$target)
forest <- randomForest(target ~ ., data=mini_data)
forest$forest$xbestsplit[,1]
I think this should be the split values for all the nodes in the first tree. The output was: 1 0 0 0 0
The randomForest documentation has this note on split values for categorical variables (for the function getTree):
For categorical predictors, the splitting point is represented by an integer, whose binary expansion gives the identities of the categories that goes to left or right. For example, if a predictor has four categories, and the split point is 13. The binary expansion of 13 is (1, 0, 1, 1) (because 13 = 1 ∗ 2^0 + 0 ∗ 2^1 + 1 ∗ 2^2 + 1 ∗ 2^3), so cases with categories 1, 3, or 4 in this predictor get sent to the left, and the rest to the right.
Under this explanation, what do 0 and 1 mean?
Upvotes: 2
Views: 704
Reputation: 10627
You want to verify the hypothesis that it is more important to have any feature (vote yes/no) than the actual value of the vote. You can use feature importance e.g. with varImpPlot(forest)
which is actually based on all the splits. Keep in mind that this gini feature importance is not additive. Moreover, it is likely that any of the features other than ""
was more often selected in the fitting process, just because there are so many of them. Therefore, I would advise to do two models instead: One with yes/no feature and another one with all individual values. Then you can see if the predictive power measured e.g. in accuracy, sensitivity of specificity is higher in the model having only the one yes/no feature. A model is used to make predictions and the internal splitting structure is only indirect relevant for this purpose. Performance is a more direct way to compare models.
Upvotes: 0