user177196
user177196

Reputation: 728

Tuning underfit catboost regressor model with fixed training set

I'm currently using CatboostRegressor(iterations=500, random_seed=123, cat_features=['month_number', 'day_of_week', 'year']) for developing a 1-year predictive model at a daily level. The predictor variables are time feature date, categorical features A1 - A4 (for example, A1 = Group: 1/2/3/4, A2 = Class: Upper/middle/lower) , and numerical features A5 - A7 (for example, A5 = Price: 100/125.6/132.7, A6 = Age 25/30/31/57). The response variable is a continuous variable, such as revenue or rental price.

My current model with CatboostRegressor(iterations=500, random_seed=123) with two initialized parameters loss_function = MAE, eval_metric = 'MAE' mostly underfits the actualized data although the final MAPE was around 10%-20% (so, some overfits offsets the underfits). However, I am unsure how the MAE function described in the documentation of Catboost defines the weight w_i for each prediction error, as the MAE is the weighted average of all the prediction errors. In particular, what do they mean by saying Use object/group weights to calculate metrics if the specified value is true (https://stats.stackexchange.com/questions/ask). I am unable to find any examples on how these weights are set, because if the weights are equally weighted, the model would probably not under-fit?

What I have tried so far to improve the under-fit issue is that I already used almost the entire training dataset to train the CatboostRegressor(), it is not a feasible option for me to increase training size (since the model needs to be trained fast enough in production environment). The most potential option now is to increase the number of iterations from 500 to 1000, although I tried this option but it did not help improve the MAPE.

Question. Is it true that default parameters of CatBoostRegressor() would change dynamically based on the dataset? What is the recommended value to tune the learning rate and number of trees. Also customize the loss_function and eval_metric to cope with underfit issue? The tutorial provided for writing custom loss function is quite complicated that I couldn't follow, so any other tangible examples would be appreciated.

Upvotes: 1

Views: 294

Answers (1)

Loc Quan
Loc Quan

Reputation: 91

I don't know if they had changed the defaults, but as of July 2024, the default number of trees (iterations) is 1000 (official docs). Also, the learning rate is dynamically adjusted based on the dataset and number of iterations. Its value is close to the optimal one (official docs). Therefore, if you are sure that the model still underfits, try to increase the number of iterations to a higher number (e.g. 1100, 1250, 1500...) , or decrease the learning rate (to a value lower than the auto-adjusted one). However, do note that increase iterations or decrease learning rate will both increase overall training time, so you may need to re-assess that whether increase training size is better for production (and also assess if other underfit reduction methods like adding more features, feature engineering...).

Regarding the loss_function and eval_metric, you can use MAPE directly now (official docs) to have more reliable results (MAPE and MAE are related but NOT proportional, so if you optimize based on MAE there's no guarantee that MAPE is also optimized).

Upvotes: 1

Related Questions