XGBoost giving slightly different predictions for list vs array, which is correct?

Question

I noticed I was passing a double bracket list of test feature values for

print(test_feats)
>> [[23.0, 3.0, 35.0, 0.28, -3.0, 18.0, 0.0, 0.0, 0.0, 3.33, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 39.0, 36.0, 113.0, 76.0, 0.0, 0.0, 1.0, 0.34, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, 0.0, 25.0, 48.0, 48.0, 0.0, 29.0, 52.0, 53.0, 99.0, 368.0, 676.0, 691.0, 4.0, 9.0, 12.0, 13.0]]

I notice when I pass this to XBGBoost for prediction it returns a different results when I turn it to an array

array_test_feats = np.array(test_feats)
print(regr.predict_proba(test_feats)[:,1][0])
print(regr.predict_proba(aray_test_feats)[:,1][0])
>> 0.46929297
>> 0.5161868

Some basic checks suggest values are the same

print(sum(test_feats[0]) == array_test_feats.sum())
print(test_feats == array_test_feats)) 
>> True
>> array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True]])

I am guessing the array is the way to go, but I really don't know how to tell. The predictions are close enough that it could easily slip by so I really would like to understand why this is happening.

Ammar Askar · Accepted Answer

You've just run into the issue described here: https://github.com/dmlc/xgboost/pull/3970

The documentation does not include lists as an allowed type for the data inputted into DMatrix. Despite this, a list can be passed in without an error. This change would prevent a list form being passed in directly.

I experienced an issue where passing in a list vs a np.array resulted in different predictions (sometimes over 10% relative difference) for the same data. Though these differences were infrequent (~1.5% of cases tested), in certain applications this could cause serious issues.

Essentially, what's happening under the hood is that passing Python lists directly is not officially supported in XGBoost but happens to work anyway because it hits a fall through case in XGBoost's data conversion.

This causes XGBoost to use the XGDMatrixCreateFromCSREx function instead of XGDMatrixCreateFromMat to create the underyling matrix for the data. There is then a difference in behavior between the missing elements in the sprase vs dense representations:

"Sparse" elements are treated as "missing" by the tree booster and as zeros by the linear booster.

XGBoost giving slightly different predictions for list vs array, which is correct?

Answers (1)

Related Questions