Reputation: 41
I'm working on a dataset containing a list of people (indexed by the fiscal code). The target variable is binary (1: buy a book, 0: otherwise). All the predictors are categorical (ex: nationality, city, road, bin of the income and so on). A fiscal-code could be repeated twice and each instance/observation have a weight (1 if not repeated, a value between 0 and 1 if repeated).
For example, the dataset looks like
fiscal_code | weight | target | categorical info
AAAAA1 | 0.98 | 0 |......
AAAAA1 | 0.02 | 1 |........
I have two dataset (with the same variables), one for train (X_train=matrix of categorical variables , y_train that is the target variable, train_weight that is weight for every observation in the train set) and one for test (with the same variables and meaning: X_test, y_test and test_weight).
I try a Catboost model- CatBoostClassifier.
categorical_features_indices = np.where(X.dtypes == np.category)[0]
model = CatBoostClassifier(iterations=5000, learning_rate=0.1, depth=7, loss_function='Logloss',eval_metric='AUC')
model.fit(X_train,
y_train,
eval_set=(X_test,y_test),
cat_features=categorical_features_indices,
use_best_model=True,
verbose=True,
sample_weight=train_weight)
The question is: how can I take into account that the observations in the TEST set have weights too (test_weight) ? Do you have any idea?
I read the documentation on https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostregressor_fit-docpage/ but I did not find anything useful, instead of lightgbm documentation (if considering another boosting model).
Upvotes: 2
Views: 4507
Reputation: 2871
My understanding is this is a case where you need to use a Pool, i.e.
model.fit(Pool(X_train,y_train,weight=train_weight)
eval_set=Pool(X_test,y_test,weight=test_weight),
cat_features=categorical_features_indices,
use_best_model=True,
verbose=True)
Upvotes: 0