Peter Chen
Peter Chen

Reputation: 1484

R Catboost to handle categorical variables

I have a question about Catboost. Whether do I preprocess the categorical before modeling?

If I have 86 variables including 1 target variable. In these 85 variables, there are 2 numeric variables and 83 categorical variables (Factor type). The target variable is binary factor, 1 or 0.

Column 1, and Column 4 to Column 85 are factors type.
Column 2 and 3 are numeric.

I am a little confused with cat_features in catboost.train(). In the parameters, I can set a vector of categorical features. Also, I can set in the catboost.load_pool.

library(Catboost)
library(dplyr)

X_train <- train %>% select(-Target)
y_train <- (as.numeric(unlist(train[c('Target')])) - 1)
X_valid <- test %>% select(-Target)
y_valid <- (as.numeric(unlist(test[c('Target')])) - 1)

train_pool <- catboost.load_pool(data = X_train, label = y_train, cat_features = c(0,3:84))
test_pool <- catboost.load_pool(data = X_valid, label = y_valid, cat_features = c(0,3:84))

params <- list(iterations=500,
               learning_rate=0.01,
               depth=10,
               loss_function='RMSE',
               eval_metric='RMSE',
               random_seed = 1,
               od_type='Iter',
               metric_period = 50,
               od_wait=20,
               use_best_model=TRUE,
               cat_features = c(0,3:84))

catboost.train(train_pool, test_pool, params = params)

However, after I ran the code above, I got an error:

Error in catboost.train(train_pool, test_pool, params = params) : 
  catboost/libs/options/plain_options_helper.cpp:339: Unknown option {cat_features} with value "[0,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84]"

Any help?

Upvotes: 2

Views: 867

Answers (2)

Rafael D&#237;az
Rafael D&#237;az

Reputation: 2289

Look at this example cat_features should not go in param <- list() only in catboost.load_pool()

library(catboost)

countries = c('RUS','USA','SUI')
years = c(1900,1896,1896)
phone_codes = c(7,1,41)
domains = c('ru','us','ch')

dataset = data.frame(countries, years, phone_codes, domains, stringsAsFactors = T)
glimpse(dataset)

label_values = c(0,1,1)

fit_params <- list(iterations = 100,
                   loss_function = 'Logloss',
                   ignored_features = c(4,9),
                   border_count = 32,
                   depth = 5,
                   learning_rate = 0.03,
                   l2_leaf_reg = 3.5)

pool = catboost.load_pool(dataset, label = label_values, cat_features = c(0,3))
model <- catboost.train(pool, params = fit_params)
model

Upvotes: 1

Clem Wang
Clem Wang

Reputation: 739

I haven't tried CatBoost in R, but see the example on this page:

https://catboost.ai/docs/concepts/r-reference_catboost-train.html

It appears you only pass the categorical variables in the load_pool() call, and NOT in the train() call.

(This works differently from the Python API, where cat_features is passed in the Python fit() call.)

A suggestion: group all the categorical variables in the left most column. That way you have a simpler vector creation. I also have a check in my code to make sure I did it right...

Upvotes: 0

Related Questions