Reputation: 1841
I noticed I was passing a double bracket list of test feature values for
print(test_feats)
>> [[23.0, 3.0, 35.0, 0.28, -3.0, 18.0, 0.0, 0.0, 0.0, 3.33, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 39.0, 36.0, 113.0, 76.0, 0.0, 0.0, 1.0, 0.34, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, -999.0, 0.0, 25.0, 48.0, 48.0, 0.0, 29.0, 52.0, 53.0, 99.0, 368.0, 676.0, 691.0, 4.0, 9.0, 12.0, 13.0]]
I notice when I pass this to XBGBoost for prediction it returns a different results when I turn it to an array
array_test_feats = np.array(test_feats)
print(regr.predict_proba(test_feats)[:,1][0])
print(regr.predict_proba(aray_test_feats)[:,1][0])
>> 0.46929297
>> 0.5161868
Some basic checks suggest values are the same
print(sum(test_feats[0]) == array_test_feats.sum())
print(test_feats == array_test_feats))
>> True
>> array([[ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True]])
I am guessing the array is the way to go, but I really don't know how to tell. The predictions are close enough that it could easily slip by so I really would like to understand why this is happening.
Upvotes: 1
Views: 864
Reputation: 535
You've just run into the issue described here: https://github.com/dmlc/xgboost/pull/3970
The documentation does not include lists as an allowed type for the data inputted into DMatrix. Despite this, a list can be passed in without an error. This change would prevent a list form being passed in directly.
I experienced an issue where passing in a list vs a np.array resulted in different predictions (sometimes over 10% relative difference) for the same data. Though these differences were infrequent (~1.5% of cases tested), in certain applications this could cause serious issues.
Essentially, what's happening under the hood is that passing Python lists directly is not officially supported in XGBoost but happens to work anyway because it hits a fall through case in XGBoost's data conversion.
This causes XGBoost to use the XGDMatrixCreateFromCSREx
function instead of XGDMatrixCreateFromMat
to create the underyling matrix for the data. There is then a difference in behavior between the missing elements in the sprase vs dense representations:
"Sparse" elements are treated as "missing" by the tree booster and as zeros by the linear booster.
Upvotes: 1