Reputation: 12515
I've built an XGBoost model and seek to examine the individual estimators. For reference, this was a binary classification task with discrete and continuous input features. The input feature matrix is a scipy.sparse.csr_matrix
.
When I went to examine an individual estimator, however, I found difficulty interpreting the binary input features, such as f60150
below. The real-valued f60150
in the bottommost chart is easy to interpret - its criterion is in the expected range of that feature. However, the comparisons being made for the binary features, <X> < -9.53674e-07
doesn't make sense. Each of these features is either 1 or 0. -9.53674e-07
is a very small negative number, and I imagine this is just some floating-point idiosyncrasy within XGBoost or its underpinning plotting libraries, but it doesn't make sense to use that comparison when the feature is always positive. Can someone help me understand which direction (i.e. yes, missing
vs. no
corresponds to which true/false side of these binary feature nodes?
Here is a reproducible example:
import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt
def booleanize_csr_matrix(mat):
''' Convert sparse matrix with positive integer elements to 1s '''
nnz_inds = mat.nonzero()
keep = np.where(mat.data > 0)[0]
n_keep = len(keep)
result = scipy.sparse.csr_matrix(
(np.ones(n_keep), (nnz_inds[0][keep], nnz_inds[1][keep])),
shape=mat.shape
)
return result
### Setup dataset
res = fetch_20newsgroups()
text = res.data
outcome = res.target
### Use default params from CountVectorizer to create initial count matrix
vec = CountVectorizer()
X = vec.fit_transform(text)
# Whether to "booleanize" the input matrix
booleanize = True
# Whether to, after "booleanizing", convert the data type to match what's returned by `vec.fit_transform(text)`
to_int = True
if booleanize and to_int:
X = booleanize_csr_matrix(X)
X = X.astype(np.int64)
# Make it a binary classification problem
y = np.where(outcome == 1, 1, 0)
# Random state ensures we will be able to compare trees and their features consistently
model = XGBClassifier(random_state=100)
model.fit(X, y)
plot_tree(model, rankdir='LR'); plt.show()
Running the above with booleanize
and to_int
set to True
yields the following chart:
Running the above with booleanize
and to_int
set to False
yields the following chart:
Heck, even if I do a really simple example, I get the "right" results, regardless of whether X
or y
are integer or floating types.
X = np.matrix(
[
[1,0],
[1,0],
[0,1],
[0,1],
[1,1],
[1,0],
[0,0],
[0,0],
[1,1],
[0,1]
]
)
y = np.array([1,0,0,0,1,1,1,0,1,1])
model = XGBClassifier(random_state=100)
model.fit(X, y)
plot_tree(model, rankdir='LR'); plt.show()
Upvotes: 25
Views: 2310
Reputation: 1
The issue you're encountering with interpreting the binary input features in your XGBoost model, specifically the comparisons such as f60150 < -9.53674e-07, is related to the handling of floating-point precision and the way XGBoost plots the tree.
In XGBoost, all features are treated as continuous values, even if they are binary. The threshold for splitting is determined by the floating-point precision, which can lead to very small values like -9.53674e-07. This is a small negative number, effectively representing a comparison against zero but due to floating-point precision, it might appear slightly negative.
For binary features (0 or 1), these comparisons can be understood as:
f60150 < -9.53674e-07: This is effectively checking if f60150 is 0, as the smallest possible value that is not zero is greater than -9.53674e-07.
Any feature value greater than this threshold can be considered as 1.
Thus, for binary features:
A comparison like f60150 < -9.53674e-07 means that if the feature f60150 is 0, it will take the true branch, and if it is 1, it will take the false branch.
Here is a summarized version of your example code and the logic behind it:
python
import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt
def booleanize_csr_matrix(mat):
''' Convert sparse matrix with positive integer elements to 1s '''
nnz_inds = mat.nonzero()
keep = np.where(mat.data > 0)[0]
n_keep = len(keep)
result = scipy.sparse.csr_matrix(
(np.ones(n_keep), (nnz_inds[0][keep], nnz_inds[1][keep])),
shape=mat.shape
)
return result
# Setup dataset
res = fetch_20newsgroups()
text = res.data
outcome = res.target
# Use default params from CountVectorizer to create initial count matrix
vec = CountVectorizer()
X = vec.fit_transform(text)
# Convert input matrix to binary (0/1)
booleanize = True
to_int = True
if booleanize and to_int:
X = booleanize_csr_matrix(X)
X = X.astype(np.int64)
# Make it a binary classification problem
y = np.where(outcome == 1, 1, 0)
# Train the model
model = XGBClassifier(random_state=100)
model.fit(X, y)
# Plot the tree
plot_tree(model, rankdir='LR')
plt.show()
This explanation should help you interpret the binary feature splits in the decision trees generated by XGBoost.
Upvotes: 0
Reputation: 1
import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt
# Convert sparse matrix with positive integer elements to 1s
def booleanize_csr_matrix(mat):
nnz_inds = mat.nonzero()
keep = np.where(mat.data > 0)[0]
n_keep = len(keep)
result = scipy.sparse.csr_matrix((np.ones(n_keep),(nnz_inds[0][keep], nnz_inds[1][keep])), shape=mat.shape)
return result
Upvotes: 0