boot-scootin
boot-scootin

Reputation: 12515

xgboost.plot_tree: binary feature interpretation

I've built an XGBoost model and seek to examine the individual estimators. For reference, this was a binary classification task with discrete and continuous input features. The input feature matrix is a scipy.sparse.csr_matrix.

When I went to examine an individual estimator, however, I found difficulty interpreting the binary input features, such as f60150 below. The real-valued f60150 in the bottommost chart is easy to interpret - its criterion is in the expected range of that feature. However, the comparisons being made for the binary features, <X> < -9.53674e-07 doesn't make sense. Each of these features is either 1 or 0. -9.53674e-07 is a very small negative number, and I imagine this is just some floating-point idiosyncrasy within XGBoost or its underpinning plotting libraries, but it doesn't make sense to use that comparison when the feature is always positive. Can someone help me understand which direction (i.e. yes, missing vs. no corresponds to which true/false side of these binary feature nodes?

Here is a reproducible example:

import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt

def booleanize_csr_matrix(mat):
    ''' Convert sparse matrix with positive integer elements to 1s '''
    nnz_inds = mat.nonzero()
    keep = np.where(mat.data > 0)[0]
    n_keep = len(keep)
    result = scipy.sparse.csr_matrix(
        (np.ones(n_keep), (nnz_inds[0][keep], nnz_inds[1][keep])),
        shape=mat.shape
    )
    return result

### Setup dataset
res = fetch_20newsgroups()

text = res.data
outcome = res.target

### Use default params from CountVectorizer to create initial count matrix
vec = CountVectorizer()
X = vec.fit_transform(text)

# Whether to "booleanize" the input matrix
booleanize = True

# Whether to, after "booleanizing", convert the data type to match what's returned by `vec.fit_transform(text)`
to_int = True

if booleanize and to_int:
    X = booleanize_csr_matrix(X)
    X = X.astype(np.int64)

# Make it a binary classification problem
y = np.where(outcome == 1, 1, 0)

# Random state ensures we will be able to compare trees and their features consistently
model = XGBClassifier(random_state=100)
model.fit(X, y)

plot_tree(model, rankdir='LR'); plt.show()

Running the above with booleanize and to_int set to True yields the following chart:

enter image description here

Running the above with booleanize and to_int set to False yields the following chart:

enter image description here

Heck, even if I do a really simple example, I get the "right" results, regardless of whether X or y are integer or floating types.

X = np.matrix(
    [
        [1,0],
        [1,0],
        [0,1],
        [0,1],
        [1,1],
        [1,0],
        [0,0],
        [0,0],
        [1,1],
        [0,1]
    ]
)

y = np.array([1,0,0,0,1,1,1,0,1,1])

model = XGBClassifier(random_state=100)
model.fit(X, y)

plot_tree(model, rankdir='LR'); plt.show()

enter image description here

Upvotes: 25

Views: 2310

Answers (2)

ABHISHEK KUMAR
ABHISHEK KUMAR

Reputation: 1

The issue you're encountering with interpreting the binary input features in your XGBoost model, specifically the comparisons such as f60150 < -9.53674e-07, is related to the handling of floating-point precision and the way XGBoost plots the tree.

In XGBoost, all features are treated as continuous values, even if they are binary. The threshold for splitting is determined by the floating-point precision, which can lead to very small values like -9.53674e-07. This is a small negative number, effectively representing a comparison against zero but due to floating-point precision, it might appear slightly negative.

For binary features (0 or 1), these comparisons can be understood as:

f60150 < -9.53674e-07: This is effectively checking if f60150 is 0, as the smallest possible value that is not zero is greater than -9.53674e-07.

Any feature value greater than this threshold can be considered as 1.

Thus, for binary features:

A comparison like f60150 < -9.53674e-07 means that if the feature f60150 is 0, it will take the true branch, and if it is 1, it will take the false branch.

Here is a summarized version of your example code and the logic behind it:

python

import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt

def booleanize_csr_matrix(mat):
    ''' Convert sparse matrix with positive integer elements to 1s '''
    nnz_inds = mat.nonzero()
    keep = np.where(mat.data > 0)[0]
    n_keep = len(keep)
    result = scipy.sparse.csr_matrix(
        (np.ones(n_keep), (nnz_inds[0][keep], nnz_inds[1][keep])),
        shape=mat.shape
    )
    return result

# Setup dataset
res = fetch_20newsgroups()
text = res.data
outcome = res.target

# Use default params from CountVectorizer to create initial count matrix
vec = CountVectorizer()
X = vec.fit_transform(text)

# Convert input matrix to binary (0/1)
booleanize = True
to_int = True
if booleanize and to_int:
    X = booleanize_csr_matrix(X)
    X = X.astype(np.int64)

# Make it a binary classification problem
y = np.where(outcome == 1, 1, 0)

# Train the model
model = XGBClassifier(random_state=100)
model.fit(X, y)

# Plot the tree
plot_tree(model, rankdir='LR')
plt.show()

This explanation should help you interpret the binary feature splits in the decision trees generated by XGBoost.

Upvotes: 0

Shivam Sharma
Shivam Sharma

Reputation: 1

import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt

# Convert sparse matrix with positive integer elements to 1s
def booleanize_csr_matrix(mat):
    nnz_inds = mat.nonzero()
    keep = np.where(mat.data > 0)[0]
    n_keep = len(keep)
    result = scipy.sparse.csr_matrix((np.ones(n_keep),(nnz_inds[0][keep], nnz_inds[1][keep])), shape=mat.shape)
    return result

Upvotes: 0

Related Questions