stats_noob
stats_noob

Reputation: 5925

Categorical Variable has a Limit of 53 Values

I am using the R programming language. I am trying to fit a "Random Forest" (a statistical model) to my data, but the problem is : one of my categorical variables has more than 53 categories - apparently the "random forest" package in R does not permit the user to have more than 53 categories, and this is preventing me from using this variable in my model. Ideally, I would like to use this variable.

To illustrate this example, I created a data set (called "data") where one of the variables has more than 53 categories:

#load libraries
library(caret)
library(randomforest)
library(ranger)

#first data set

cat_var <- c("a","b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", 
"s", "t", "u", "v", "w", "x", "y", "z", "aa", "bb", "cc", "dd", "ee", "ff",
 "gg", "hh", "ii", "jj", "kk", "ll", "mm", "nn", "oo", "pp", "qq", "rr", "ss", "tt", "uu", "vv", "ww", "xx", "yy", "zz", "aaa", "bbb")

var_1 <- rnorm(54,10,10)
var_2 <- rnorm(54, 5, 5)
var_3 <- rnorm(54, 6,18)

response <- c("a","b")
response <- sample(response, 54, replace=TRUE, prob=c(0.3, 0.7))

data_1 = data.frame(cat_var, var_1, var_2, var_3, response)
data_1$response = as.factor(data_1$response)
data_1$cat_var = as.factor(data_1$cat_var)

#second data set


cat_var <- c("a","b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", 
"s", "t", "u", "v", "w", "x", "y", "z", "aa", "bb", "cc", "dd", "ee", "ff",
 "gg", "hh", "ii", "jj", "kk", "ll", "mm", "nn", "oo", "pp", "qq", "rr", "ss", "tt", "uu", "vv", "ww", "xx", "yy", "zz", "aaa", "bbb")

var_1 <- rnorm(54,10,10)
var_2 <- rnorm(54, 5, 5)
var_3 <- rnorm(54, 6,18)

response <- c("a","b")
response <- sample(response, 54, replace=TRUE, prob=c(0.3, 0.7))

data_2 = data.frame(cat_var, var_1, var_2, var_3, response)
data_2$response = as.factor(data_2$response)
data_2$cat_var = as.factor(data_2$cat_var)

# third data set


cat_var <- c("a","b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", 
"s", "t", "u", "v", "w", "x", "y", "z", "aa", "bb", "cc", "dd", "ee", "ff",
 "gg", "hh", "ii", "jj", "kk", "ll", "mm", "nn", "oo", "pp", "qq", "rr", "ss", "tt", "uu", "vv", "ww", "xx", "yy", "zz", "aaa", "bbb")

var_1 <- rnorm(54,10,10)
var_2 <- rnorm(54, 5, 5)
var_3 <- rnorm(54, 6,18)

response <- c("a","b")
response <- sample(response, 54, replace=TRUE, prob=c(0.3, 0.7))

data_3 = data.frame(cat_var, var_1, var_2, var_3, response)
data_3$response = as.factor(data_3$response)
data_3$cat_var = as.factor(data_3$cat_var)

#combine data sets

data = rbind(data_1, data_2, data_3)

From here, I am interested in fitting the random forest model. I looked at different stackoverflow posts (e.g. R randomForest too many categories error even with fewer than 53 categories , R - Random Forest and more than 53 categories), and here is what I noticed.

Here is what happens if you try to fit the random forest model as is:

#random forest using the "Randomforest" library

rf = randomForest(response ~ var_1 + var_2 + var_3 + cat_var, data=data, ntree=50, mtry=2)

Error in randomForest.default(m, y, ...) : 
  Can not handle categorical predictors with more than 53 categories.

In one of these posts, a user suggested using the "caret" library to fit the model - apparently the caret model does not have the 53 category limitation. This works, but I am not sure if this is correct:

#random forest using the "caret" and "ranger" libraries: (are these correct?)

random_forest <- train(response ~., 
                 data = data, 
                 method = 'ranger')

random_forest <- train(response ~., 
                 data = data, 
                 method = 'rf')

Finally, another user suggested using the "model matrix" approach, but I am not sure if I understood this approach all together:

#model matrix method


dummyMat <- model.matrix(response ~ var_1 + var_2 + var_3 + cat_var, data, # set contrasts.arg to keep all levels
                         contrasts.arg = list(var_1 = contrasts(data$var_1, contrasts = T),  var_3 = contrasts(data$var_3, contrasts = T),  cat_var = contrasts(data$cat_var, contrasts = F)
                                             var_2 = contrasts(data$var2, contrasts = T))) 
data2 <- cbind(data, dummyMat[,c(4:ncol(dummyMat)]) # just removing intercept column

rf = randomForest(response ~ var_1 + var_2 + var_3 + cat_var, data=data2, ntree=50, mtry=2)

Can someone please suggest how can I solve this problem? Is the second approach (using "caret") correct?

Thanks

Upvotes: 0

Views: 743

Answers (1)

Elia
Elia

Reputation: 2584

I can tell you that the caret approach is correct. caret contains tools for data splitting, preprocessing, feature selection and model tuning with resampling cross-validation. Here I post a typical workflow for fitting a model with the caret package (example with the data you posted).

First, we set a cross-validation method for tuning the hyperparameters of the chosen model (in your case the tuning parameters are mtry for both ranger and randomForest, splitrule and min.node.size for ranger). In the example, I choose a k-fold corss-validation with k=10

library(caret)
control <- trainControl(method="cv",number = 10)

then we create a grid with the possible values that the parameters to be tuned can assume

rangergrid <- expand.grid(mtry=2:(ncol(data)-1),splitrule="extratrees",min.node.size=seq(0.1,1,0.1))
rfgrid <- expand.grid(mtry=2:(ncol(data)-1))

finally, we fit the chosen models:

random_forest_ranger <- train(response ~., 
                       data = data, 
                       method = 'ranger',
                       trControl=control,
                       tuneGrid=rangergrid)


random_forest_rf <- train(response ~., 
                       data = data, 
                       method = 'rf',
                       trControl=control,
                       tuneGrid=rfgrid)

the output of the train function look like this:

> random_forest_rf
Random Forest 

162 samples
  4 predictor
  2 classes: 'a', 'b' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 146, 146, 146, 145, 146, 146, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      
  2     0.6852941   0.00000000
  3     0.6852941   0.00000000
  4     0.6602941  -0.04499494

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

For more info on the caret package look a the online vignette

Upvotes: 1

Related Questions