user10360304
user10360304

Reputation: 171

XGBoost Error in R Studio ("'data' has class 'character' and length...")

I am having difficulties fitting my data to an xgboost classifier model. When I run this:

classifier = xgboost(data = as.matrix(training_set[c(4:15, 17:18,20:28)]), 
  label = training_set$posted_ind, nrounds = 10)

R Studio tells me:

Error in xgb.DMatrix(data, label = label, missing = missing) : 
'data' has class 'character' and length 1472000.
'data' accepts either a numeric matrix or a single filename. 

The training set data has both continuous and categorical data, but all categorical data has been encoded as such (and the same data fit to random forest and naive bayes models). Is there some additional step I need to complete so that I can use these data in an xgboost model?

Upvotes: 7

Views: 9721

Answers (3)

Haariss
Haariss

Reputation: 21

What is working for me while using tidymodels is adding a recipe step for dummy encoding:

step_dummy(all_nominal_predictors(), one_hot = TRUE)

Upvotes: 0

user2205916
user2205916

Reputation: 3456

I came across the same problem and found a complete solution. You have to use:

sparse_matrix <- sparse.model.matrix(label_y ~ ., data = df)[,-1]
X_train_dmat = xgb.DMatrix(sparse_matrix, label = df$label)

This transforms the categorical data to dummy variables. Several encoding methods exist, e.g., one-hot encoding is a common approach. The above is dummy contrast coding which is popular because it produces “full rank” encoding (also see this blog post by Max Kuhn).

The purpose is to transform each value of each categorical feature into a binary feature {0, 1}.

For example, a column Treatment will be replaced by two columns, TreatmentPlacebo, and TreatmentTreated. Each of them will be binary. Therefore, an observation which has the value Placebo in column Treatment before the transformation will have after the transformation the value 1 in the new column TreatmentPlacebo and the value 0 in the new column TreatmentTreated. The column TreatmentPlacebo will disappear during the contrast encoding, as it would be absorbed into a common constant intercept column.

Source: https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html#conversion-from-categorical-to-numeric-variables

Upvotes: 3

user11442360
user11442360

Reputation: 81

Make sure that your "training_set" does not have any columns that are factors. If you encoded your categorical variables as numeric but casted them as factors, you will get this error.

Upvotes: 8

Related Questions