user3476463
user3476463

Reputation: 4575

determine minimal number of features for xgboost regression using AIC

I am trying to find the minimum number of features to include in my xgboost regression model, to avoid overfitting. I'm doing this by fitting my xgboost model to my data (X_train) and predicting, then calculating AIC (as shown in the python code below). Then I drop one feature out of the X_train pandas dataframe, fit, predict, and calculate AIC again. What I'm noticing is that as I drop features out of my X_train dataframe my AIC is actually going up. What I'm wondering is if my approach is correct? If my AIC calculation is correct? If the answers to both those questions are yes, then what might explain the AIC increasing as I drop features from the data the model is trained on? The code I'm using to calc AIC is below.

code:

import pandas as pd
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import numpy as np

# Assuming X_train and y_train are pandas DataFrames
# Convert them to numpy arrays, as XGBoost works with numpy arrays directly
X_train_np = X_train.values
y_train_np = y_train.values.flatten()  # Assuming y_train is a single column DataFrame, convert it to a 1D array

# Fit the XGBoost regression model
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror')
xg_reg.fit(X_train_np, y_train_np)

# Predict on the training data
y_pred = xg_reg.predict(X_train_np)

# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_train_np, y_pred)

# Number of parameters in the model
num_params = len(xg_reg.get_booster().get_dump()) + 1  # Adding 1 for the intercept

# Calculate the Akaike Information Criterion (AIC)
n = len(y_train_np)
aic = n * np.log(mse) + 2 * num_params

print("AIC:", aic)

Upvotes: 0

Views: 329

Answers (1)

Nick ODell
Nick ODell

Reputation: 25190

What I'm wondering is if my approach is correct?

I think what most data-science practitioners use in this situation is cross-validated error on whatever metric they're interested in.

This is partially for practical reasons (decision tree methods can memorize all training points) and partially for theoretical reasons (what is a parameter?)

If my AIC calculation is correct?

I don't think so. If you want to calculate log-likelihood from MSE, assuming your error term is a gaussian, you need to know the sigma to find the probability of the data under the model, and XGB doesn't give that to you. See here for an example.

Another weird aspect of this metric is how decision trees can memorize particular training examples.

In the following program, I've set eta=1, to select the maximum learning rate. This is normally a recipe for overfit.

However, it results in extremely accurate in-sample prediction, and therefore a very low log-likelihood score.

Here's an example. You can try this with eta=1 and without.

import pandas as pd
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn import datasets
import numpy as np

# Fit the XGBoost regression model
diabetes = datasets.load_diabetes()

xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', eta=1, n_estimators=100)
X, y = diabetes.data, diabetes.target

xg_reg.fit(X, y)

# Predict on the training data
y_pred = xg_reg.predict(X)

# Calculate the mean squared error (MSE)
mse = mean_squared_error(y, y_pred)

# Number of parameters in the model
print("mse:", mse)
num_params = len(xg_reg.get_booster().get_dump()) + 1  # Adding 1 for the intercept

# Calculate the Akaike Information Criterion (AIC)
n = len(y)
aic = n * np.log(mse) + 2 * num_params

print("AIC:", aic)
print("params:", num_params)

The calculation of num_params is also strange. It is counting the number of trees. Each of those trees will contain many decision nodes, each of which has a threshold which dictates its threshold for going right or left at that node. So this is probably an under-count of parameters.

However, I don't think there is a clear theoretical answer on the right way to do this this. Here are some different ideas on how to calculate this number:

Upvotes: 1

Related Questions