Reputation: 728
I'm currently using CatboostRegressor(iterations=500, random_seed=123, cat_features=['month_number', 'day_of_week', 'year'])
for developing a 1-year predictive model at a daily level. The predictor variables are time feature date
, categorical features A1 - A4
(for example, A1 = Group: 1/2/3/4, A2 = Class: Upper/middle/lower
) , and numerical features A5 - A7
(for example, A5 = Price: 100/125.6/132.7, A6 = Age 25/30/31/57
). The response variable is a continuous variable, such as revenue
or rental price.
My current model with CatboostRegressor(iterations=500, random_seed=123)
with two initialized parameters loss_function = MAE, eval_metric = 'MAE'
mostly underfits the actualized data although the final MAPE
was around 10%-20%
(so, some overfits offsets the underfits). However, I am unsure how the MAE
function described in the documentation of Catboost defines the weight w_i
for each prediction error, as the MAE
is the weighted average of all the prediction errors. In particular, what do they mean by saying Use object/group weights to calculate metrics if the specified value is true (https://stats.stackexchange.com/questions/ask). I am unable to find any examples on how these weights are set, because if the weights are equally weighted, the model would probably not under-fit?
What I have tried so far to improve the under-fit issue is that I already used almost the entire training dataset to train the CatboostRegressor(),
it is not a feasible option for me to increase training size (since the model needs to be trained fast enough in production environment). The most potential option now is to increase the number of iterations from 500
to 1000
, although I tried this option but it did not help improve the MAPE.
Question. Is it true that default parameters of CatBoostRegressor()
would change dynamically based on the dataset? What is the recommended value to tune the learning rate
and number of trees.
Also customize the loss_function
and eval_metric
to cope with underfit
issue? The tutorial provided for writing custom loss function is quite complicated that I couldn't follow, so any other tangible examples would be appreciated.
Upvotes: 1
Views: 294
Reputation: 91
I don't know if they had changed the defaults, but as of July 2024, the default number of trees
(iterations
) is 1000
(official docs). Also, the learning rate
is dynamically adjusted based on the dataset and number of iterations
. Its value is close to the optimal one (official docs). Therefore, if you are sure that the model still underfits, try to increase the number of iterations to a higher number (e.g. 1100
, 1250
, 1500
...) , or decrease the learning rate (to a value lower than the auto-adjusted one). However, do note that increase iterations
or decrease learning rate
will both increase overall training time, so you may need to re-assess that whether increase training size is better for production (and also assess if other underfit reduction methods like adding more features, feature engineering...).
Regarding the loss_function
and eval_metric
, you can use MAPE
directly now (official docs) to have more reliable results (MAPE
and MAE
are related but NOT proportional, so if you optimize based on MAE
there's no guarantee that MAPE
is also optimized).
Upvotes: 1