Lolivano
Lolivano

Reputation: 152

Native xgb and XGBRegressor same predictions but not the same metric

I don't understand why the metric are not the same between xgb.train and xgb.XGBRegressor. I do have the same prediction values. Do you have an idea ?

Here below a little example on simulated data.

import libraries

import numpy as np
import pandas as pd
import xgboost as xgb
import plotly.express as px
import json

The simulated data

n = 1000

braking = np.random.normal(10, 2, n)
acceleration = np.random.normal(8, 1.5, n)
phone = np.random.normal(1, 0.5, n)
distance = np.random.normal(50, 50, n)

simdf = pd.DataFrame({
    "braking": braking,
    "acceleration": acceleration,
    "phone": phone,
    "distance": distance
})
simdf['distance'] = np.where(simdf['distance'] < 2, 2, simdf['distance'])
simdf['phone'] = np.where(simdf['phone'] < 0, 0, simdf['phone'])
mu_A = np.exp(-1 + 0.02 * simdf['braking'] + 0.001 * simdf['acceleration'] + 0.0008 * simdf['distance'])
y_A = np.random.poisson(mu_A, n)
simdf['response'] = y_A
simdf['margin'] = 0.02 * simdf['braking'] + 0.001 * simdf['acceleration']

Set parameters for the native xgboost

model_param = {'objective': 'count:poisson', 
               'monotone_constraints': (1,1,1,1), 
               'n_estimators': 50, 
               'seed': 12345,
               'eval_metric': 'poisson-nloglik',
               }

Compute the native xgboost

xgbMatrix_A = xgb.DMatrix(simdf_train[["braking","acceleration","phone","distance"]], 
                          label=simdf_train[["response"]])

xgbMatrix_A.set_info(base_margin=np.log(simdf_train[["margin"]]))
bst_A = xgb.train(model_param,
    xgbMatrix_A,
    num_boost_round=50,
    evals = [(xgbMatrix_A,"train")]
)
bst_A
sim_df_pred = xgb.DMatrix(simdf[["braking","acceleration","phone","distance"]])
sim_df_pred.set_info(base_margin=np.log(simdf[["margin"]]))
predictions = bst_A.predict(sim_df_pred)
simdf['pred_python'] = predictions

Get parameters of the native xgboost and update them

config = json.loads(bst_A.save_config())
model_param = config['learner']['gradient_booster']['updater']['grow_colmaker']['train_param']
model_param.update({'objective': 'count:poisson',
               'n_estimators': 50,
               'eval_metric': 'poisson-nloglik'
               })

Compute the XGBRegressor model

bst_B = xgb.XGBRegressor(**model_param)
bst_B.fit(simdf_train[["braking","acceleration","phone","distance"]],simdf_train[["response"]], base_margin=np.log(simdf_train[["margin"]]),
          eval_set=[(simdf_train[["braking","acceleration","phone","distance"]], simdf_train[["response"]])])
predictions = bst_B.predict(simdf[["braking","acceleration","phone","distance"]], base_margin=np.log(simdf[["margin"]]))
simdf['pred_python_sk_log'] = predictions

Compare the predictions and the metrics

## prediction comparison 
np.sum(simdf["pred_python_sk_log"] - simdf["pred_python"])

## metric comparison 
print(bst_A.eval(xgbMatrix_A))
print(bst_B.evals_result()["validation_0"]["poisson-nloglik"][-1])

Same predictions (the sum is 0) but the metrics are not the same.

Upvotes: 1

Views: 95

Answers (1)

Lolivano
Lolivano

Reputation: 152

After some research I finally found what I have to add to the code to obtain the same value for the metric. We need to add the parameterbase_margin_eval_set.

In the step "Compute the XGBRegressor model" replace the line bst_B.fit = ... by:

bst_B.fit(simdf_train[["braking","acceleration","phone","distance"]],simdf_train[["response"]], 
          base_margin_eval_set= [np.log(simdf_train[["margin"]])],
          base_margin=np.log(simdf_train[["margin"]]), 
          eval_set=[(simdf_train[["braking","acceleration","phone","distance"]], simdf_train[["response"]])])

Upvotes: 1

Related Questions