Reputation: 9
I am trying to use a categorical predictor in an xgboost algorithm, but keep getting errors. Here are the relevant parts of my code.
df = data[["country_name", "Timestamp", "Flow Duration", "Flow IAT Min", "Src Port", "Tot Fwd Pkts", "Init Bwd Win Byts", "Label"]]
from pandas.api.types import CategoricalDtype
df["country_name"] = df["country_name"].astype(CategoricalDtype(ordered=True))
X = df[["country_name", "Flow Duration", "Flow IAT Min", "Src Port", "Tot Fwd Pkts", "Init Bwd Win Byts"]]
df["Label"] = df["Label"].replace(['benign','ddos'],[0,1])
y = df["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model2 = xgb.XGBClassifier(tree_method="gpu_hist", enable_categorical=True, use_label_encoder = False)
model2.fit(X_train,y_train)
I also tried using .astype("category") too and it didn't work. I keep getting this error when I run the last bit of code:
ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When
categorical type is supplied, DMatrix parameter
`enable_categorical` must be set to `True`.country_name
Any help would be appreciated, thank you!!
Upvotes: 0
Views: 1217
Reputation: 1
You can explicitly make you DMatrix and that is where you need to enable categorical
e.g.
train_x, valid_x, train_y, valid_y = train_test_split(x_subfeatures, y_encoded, train_size=.75)
dtrain = xgb.DMatrix(
train_x,
label=train_y,
#enable categorical data
enable_categorical=True
)
dvalid = xgb.DMatrix(
valid_x,
label=valid_y,
enable_categorical=True
)
Upvotes: 0
Reputation: 1
Ideally you check / attach the .dtypes of all your relevant predictors.
In this specific case, country_name might be of object type, i.e. you would need to encode this variable first.
To encode, you can choose among the following: https://contrib.scikit-learn.org/category_encoders/
Upvotes: -1