Reputation: 3865
Let me first explain about data set that I am using.
I have three set.
This set is also building using time slices. Train set is oldest data. Hold out set is newest data. and Eval set is in middle set.
Now I am building two models.
Model1:
# Initialize CatBoostClassifier
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations=300,
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train_set.values, Y_train.values,
cat_features=categorical_features_indices,
eval_set=(test_set.values, Y_test)
# logging_level='Verbose' # you can uncomment this for text output
)
predicting on hold out set.
Model2:
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations= 'bestIteration from model1',
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train.values, Y.values,
cat_features=categorical_features_indices,
# logging_level='Verbose' # you can uncomment this for text output
)
Both model is identical except iterations. First model has fix 300 round, but it will Shrink model to bestIteration. Where second model uses that bestIteration from model1.
However, When I compare feature importance. It looks drastically difference.
Feature Score_m1 Score_m2 delta
0 x0 3.612309 2.013193 -1.399116
1 x1 3.390630 3.121273 -0.269357
2 x2 2.762750 1.822564 -0.940186
3 x3 2.553052 NaN NaN
4 x4 2.400786 0.329625 -2.071161
As you can see one of feature x3 which was on top3 in first model, dropped off in second model. Not only that but there is large shift in weights between models for given feature. There are about 60 features that are present in model1 are not present in model2. And there about 60 features that present in model2 are not present in model1. delta is difference between Score_m1 and Score_m2. I have seen where model changes score little bit not this drastic. AUC and LogLoss doesn't change that much when I use model1 or model2.
Now I have following questions regarding this situation.
Is this models are instable due to small number of sample and large number of features. If this is case, how to check for this?
Are there feature in this model are just not giving that much information regarding model outcome and there is random change that it is creating split. If this case how to check for this situation?
This catboost is right model for this situation ?
Any help regarding this issue will be appreciated
Upvotes: 1
Views: 1928
Reputation: 739
Yes. Trees in general are somewhat unstable. If you remove the least important feature, you can get a very different model.
Having more data reduces this tendency.
Having more features increases this tendency.
Tree algorithms are random by nature, so the results will be different.
Things to try:
Run the model a large number of times but with different random seeds. Use the results to determine which feature seems to be the least important. (How many features do you have?)
Try to balance your training set. This might require you to upsample the rarer cases.
Get more data. Maybe you'll have to combine your train and test set and use the holdout as the test.
Upvotes: 1