Reputation: 171
I am having difficulties fitting my data to an xgboost classifier model. When I run this:
classifier = xgboost(data = as.matrix(training_set[c(4:15, 17:18,20:28)]),
label = training_set$posted_ind, nrounds = 10)
R Studio tells me:
Error in xgb.DMatrix(data, label = label, missing = missing) :
'data' has class 'character' and length 1472000.
'data' accepts either a numeric matrix or a single filename.
The training set data has both continuous and categorical data, but all categorical data has been encoded as such (and the same data fit to random forest and naive bayes models). Is there some additional step I need to complete so that I can use these data in an xgboost model?
Upvotes: 7
Views: 9721
Reputation: 21
What is working for me while using tidymodels is adding a recipe step for dummy encoding:
step_dummy(all_nominal_predictors(), one_hot = TRUE)
Upvotes: 0
Reputation: 3456
I came across the same problem and found a complete solution. You have to use:
sparse_matrix <- sparse.model.matrix(label_y ~ ., data = df)[,-1]
X_train_dmat = xgb.DMatrix(sparse_matrix, label = df$label)
This transforms the categorical data to dummy variables. Several encoding methods exist, e.g., one-hot encoding is a common approach. The above is dummy contrast coding which is popular because it produces “full rank” encoding (also see this blog post by Max Kuhn).
The purpose is to transform each value of each categorical feature into a binary feature {0, 1}.
For example, a column Treatment
will be replaced by two columns, TreatmentPlacebo
, and TreatmentTreated
. Each of them will be binary. Therefore, an observation which has the value Placebo
in column Treatment
before the transformation will have after the transformation the value 1
in the new column TreatmentPlacebo
and the value 0
in the new column TreatmentTreated
. The column TreatmentPlacebo
will disappear during the contrast encoding, as it would be absorbed into a common constant intercept column.
Upvotes: 3
Reputation: 81
Make sure that your "training_set" does not have any columns that are factors. If you encoded your categorical variables as numeric but casted them as factors, you will get this error.
Upvotes: 8