Reputation: 688
When creating a specficication and fitting a decision tree with tidymodels
metapackage and decision_tree()
function, the default splitting method/rule in rpart
package for categorical data is the Gini index, which is set with the params argument of rpart::rpart()
.
Also, creating a random forest model with ranger
engine uses the same default for categorical data. My question is: How can I change the splitting method to information gain or shannon entropy?
Here is an example (focus on str()
calls and the formas_forest_fit
object to see the splitrules)
# install.packages(c("tidymodels", "rpart", "ranger"))
library(tidymodels)
formas <- tibble(
Color = c("Rojo", "Azul", "Rojo", "Verde", "Rojo", "Verde"),
Forma = c("Cuadrado", "Cuadrado", "Redondo", "Cuadrado", "Redondo", "Cuadrado"),
`Tamaño` = c("Grande", "Grande", "Pequeño", "Pequeño", "Grande", "Grande"),
Compra = structure(c(2L, 2L, 1L, 1L, 2L, 1L), .Label = c("No", "Si"), class = "factor")
)
# Tree spec and fit -----------------------
formas_tree_spec <-
decision_tree(min_n = 2) %>%
set_mode("classification") %>%
set_engine("rpart")
formas_tree_fit <-
fit(
formas_tree_spec,
data = formas,
formula = Compra ~ .
)
# Forest spec and fit ----------------------
formas_forest_spec <-
rand_forest(trees = 5000, min_n = 2) %>%
set_mode("classification") %>%
set_engine("ranger")
formas_forest_fit <-
fit(
formas_forest_spec,
data = formas,
formula = Compra ~ .
)
str(rpart::rpart)
str(ranger::ranger)
formas_forest_fit
Upvotes: 0
Views: 229
Reputation: 688
Following Emil Hvidfeldt's suggestion, the set_engine()
function accepts us to pass arguments directly to the engine function.
This is the tree with information gain splitting rule:
formas_tree_spec <-
decision_tree(min_n = 2) %>%
set_mode("classification") %>%
set_engine("rpart", parms = list(split = "information")
Upvotes: 1