Reputation: 152
I don't understand why the metric are not the same between xgb.train and xgb.XGBRegressor. I do have the same prediction values. Do you have an idea ?
Here below a little example on simulated data.
import libraries
import numpy as np
import pandas as pd
import xgboost as xgb
import plotly.express as px
import json
The simulated data
n = 1000
braking = np.random.normal(10, 2, n)
acceleration = np.random.normal(8, 1.5, n)
phone = np.random.normal(1, 0.5, n)
distance = np.random.normal(50, 50, n)
simdf = pd.DataFrame({
"braking": braking,
"acceleration": acceleration,
"phone": phone,
"distance": distance
})
simdf['distance'] = np.where(simdf['distance'] < 2, 2, simdf['distance'])
simdf['phone'] = np.where(simdf['phone'] < 0, 0, simdf['phone'])
mu_A = np.exp(-1 + 0.02 * simdf['braking'] + 0.001 * simdf['acceleration'] + 0.0008 * simdf['distance'])
y_A = np.random.poisson(mu_A, n)
simdf['response'] = y_A
simdf['margin'] = 0.02 * simdf['braking'] + 0.001 * simdf['acceleration']
Set parameters for the native xgboost
model_param = {'objective': 'count:poisson',
'monotone_constraints': (1,1,1,1),
'n_estimators': 50,
'seed': 12345,
'eval_metric': 'poisson-nloglik',
}
Compute the native xgboost
xgbMatrix_A = xgb.DMatrix(simdf_train[["braking","acceleration","phone","distance"]],
label=simdf_train[["response"]])
xgbMatrix_A.set_info(base_margin=np.log(simdf_train[["margin"]]))
bst_A = xgb.train(model_param,
xgbMatrix_A,
num_boost_round=50,
evals = [(xgbMatrix_A,"train")]
)
bst_A
sim_df_pred = xgb.DMatrix(simdf[["braking","acceleration","phone","distance"]])
sim_df_pred.set_info(base_margin=np.log(simdf[["margin"]]))
predictions = bst_A.predict(sim_df_pred)
simdf['pred_python'] = predictions
Get parameters of the native xgboost and update them
config = json.loads(bst_A.save_config())
model_param = config['learner']['gradient_booster']['updater']['grow_colmaker']['train_param']
model_param.update({'objective': 'count:poisson',
'n_estimators': 50,
'eval_metric': 'poisson-nloglik'
})
Compute the XGBRegressor model
bst_B = xgb.XGBRegressor(**model_param)
bst_B.fit(simdf_train[["braking","acceleration","phone","distance"]],simdf_train[["response"]], base_margin=np.log(simdf_train[["margin"]]),
eval_set=[(simdf_train[["braking","acceleration","phone","distance"]], simdf_train[["response"]])])
predictions = bst_B.predict(simdf[["braking","acceleration","phone","distance"]], base_margin=np.log(simdf[["margin"]]))
simdf['pred_python_sk_log'] = predictions
Compare the predictions and the metrics
## prediction comparison
np.sum(simdf["pred_python_sk_log"] - simdf["pred_python"])
## metric comparison
print(bst_A.eval(xgbMatrix_A))
print(bst_B.evals_result()["validation_0"]["poisson-nloglik"][-1])
Same predictions (the sum is 0) but the metrics are not the same.
Upvotes: 1
Views: 95
Reputation: 152
After some research I finally found what I have to add to the code to obtain the same value for the metric. We need to add the parameterbase_margin_eval_set
.
In the step "Compute the XGBRegressor model" replace the line bst_B.fit = ...
by:
bst_B.fit(simdf_train[["braking","acceleration","phone","distance"]],simdf_train[["response"]],
base_margin_eval_set= [np.log(simdf_train[["margin"]])],
base_margin=np.log(simdf_train[["margin"]]),
eval_set=[(simdf_train[["braking","acceleration","phone","distance"]], simdf_train[["response"]])])
Upvotes: 1