How to get catboost visualization to show the categories

Question

Consider the following data:

import pandas as pd
y_train = pd.DataFrame({0: {14194: 'Fake', 13891: 'Fake', 13247: 'Fake', 11236: 'Fake', 2716: 'Real', 2705: 'Real', 16133: 'Fake', 7652: 'Real', 7725: 'Real', 16183: 'Fake'}})

X_train = pd.DataFrame({'one': {14194: 'e',
  13891: 'b',
  13247: 'v',
  11236: 't',
  2716: 'e',
  2705: 'e',
  16133: 'h',
  7652: 's',
  7725: 's',
  16183: 's'},
 'two': {14194: 'a',
  13891: 'a',
  13247: 'e',
  11236: 'n',
  2716: 'c',
  2705: 'a',
  16133: 'n',
  7652: 'e',
  7725: 'h',
  16183: 'e'},
 'three': {14194: 's',
  13891: 'l',
  13247: 'n',
  11236: 'c',
  2716: 'h',
  2705: 'r',
  16133: 'i',
  7652: 'r',
  7725: 'e',
  16183: 's'},
 'four': {14194: 'd',
  13891: 'e',
  13247: 'r',
  11236: 'g',
  2716: 'o',
  2705: 'r',
  16133: 'p',
  7652: 'v',
  7725: 'r',
  16183: 'i'},
 'five': {14194: 'f',
  13891: 'b',
  13247: 'o',
  11236: 'b',
  2716: 'i',
  2705: 'i',
  16133: 'i',
  7652: 'i',
  7725: 'b',
  16183: 'i'},
 'six': {14194: 'p',
  13891: 's',
  13247: 'l',
  11236: 'l',
  2716: 'n',
  2705: 'n',
  16133: 'n',
  7652: 'l',
  7725: 'e',
  16183: 'u'},
 'seven': {14194: 's',
  13891: 's',
  13247: 's',
  11236: 'e',
  2716: 'g',
  2705: 'g',
  16133: 's',
  7652: 'e',
  7725: 't',
  16183: 'r'}})

and the following code:

from catboost import CatBoostClassifier
from catboost import Pool
cat_features = list(X_train.columns)
pool = Pool(X_train, y_train, cat_features=list(range(7)), feature_names=cat_features)
model = CatBoostClassifier(verbose=0).fit(pool)
model.plot_tree(
tree_idx=1,
pool=pool # "pool" is required parameter for trees with one hot features
)

I get the following:

But I don't understand what {five} pr_num0 tb0 type0, value>8 means. I was hoping it would look like the titanic example from the manual which is:

import catboost
from catboost import CatBoostClassifier, Pool

from catboost.datasets import titanic
titanic_df = titanic()

X = titanic_df[0].drop('Survived',axis=1)
y = titanic_df[0].Survived

is_cat = (X.dtypes != float)
for feature, feat_is_cat in is_cat.to_dict().items():
    if feat_is_cat:
        X[feature].fillna("NAN", inplace=True)

cat_features_index = np.where(is_cat)[0]
pool = Pool(X, y, cat_features=cat_features_index, feature_names=list(X.columns))

model = CatBoostClassifier(
    max_depth=2, verbose=False, max_ctr_complexity=1, iterations=2).fit(pool)

model.plot_tree(
    tree_idx=0,
    pool=pool
)

This gives:

How can I get the equivalent of Sex, value = Female for my example? That would be for example, One, value = b.

Akshay Sehgal · Accepted Answer

TLDR; This is not really a visualization problem but more on how a feature-split is done in Catboost.

Catboost decides which feature to one-hot and which to ctr based on a parameter called one_hot_max_size. If the number of classes in a feature is <= one_hot_max_size then it would be treated as a one-hot. By default its set to 2. So only binary features (0,1 or male,female) are considered as one-hot and others (such as PClass -> 1,2,3) are handled as ctr. Setting it high enough will allow you to force catboost to encode your columns as one-hot.

The {five} pr_num0 tb0 type0, value>8 is basically a label, value for a ctr split. There is no documentation available for this but after inspecting the github repo, it seems the label is generated using a multi-hash.

More details below.

How are feature-splits selected?

A feature-split pair is chosen for a leaf in 3 steps:

A list is formed of possible candidates (“feature-split pairs”) to be assigned to a leaf as the split.
A number of penalty functions are calculated for each object (on the condition that all of the candidates obtained from step 1 have been assigned to the leaf).
The split with the smallest penalty is selected.

Types of feature-splits

There are three types of splits: FloatFeature, OneHotFeature and OnlineCtr. These are based on the encoding that is done on the features.

FloatFeature: A float feature split takes a float type feature and split value is calculated (border). A float feature is represented in the visualization as with a feature index and border value (check this):

9, border<257.23    #feature index, border value

OneHotFeature: In a one-hot feature, each class can be represented by a max of n possible values (0 or 1). The n is decided by a parameter called one_hot_max_size which is set to 2 by default. Note in the titanic dataset case, Sex has only 2 possible values, Male or Female. If you set one_hot_max_size=4 then catboost uses one hot to encode features with upto 4 unique classes (e.g. Pclass in titanic has 3 unique classes). A one-hot feature is represented with the feature name, and its value:

Sex, value=Female    #feature name, value

OnlineCtr: A ctr the third type of split that you can see in a catboost model. Ctrs are not calculated for features that are used with one-hot encoding (link). If the number of possible classes in a feature exceeds the limit set by one_hot_max_size then catboost automatically uses ctr to encode features and thus type of split is an OnlineCtr. Its represented with the feature name, some dummy tokens that represent unique classes and a value:

{five} pr_num1 tb0 type0, value>9  #Label, value

##Inspecting github, the label seems to be from a multihash
##The multihash seems to be made from (CatFeatureIdx, CtrIdx, TargetBorderIdx, PriorIdx)
##https://github.com/catboost/catboost/blob/master/catboost/libs/data/ctrs.h

Analyzing the dataset at hand

Let's first look at the number of unique classes in each of the features.

from catboost import CatBoostClassifier, Pool
import pandas as pd

X_train.describe().loc['unique']

one      6
two      5
three    8
four     8
five     4
six      6
seven    5
Name: unique, dtype: object

As you can see, the minimum number of unique classes is 4 (in feature called "five") and the maximum is 8. Lets set our one_hot_max_size = 4.

cat_features = list(X_train.columns)
pool = Pool(X_train, y_train, cat_features=list(range(7)), feature_names=cat_features)
model = CatBoostClassifier(verbose=0, one_hot_max_size=4).fit(pool)

model.plot_tree(tree_idx=1,pool=pool)

The feature "five" is now OneHotFeature and results in a split description of five, value=i. However, feature "One" is still a OnlineCtr.

Lets now set the one_hot_max_size = 8, which is the max possible unique classes. This would ensure that each of the features is OneHotFeature and not OnlineCtr

cat_features = list(X_train.columns)
pool = Pool(X_train, y_train, cat_features=list(range(7)), feature_names=cat_features)
model = CatBoostClassifier(verbose=0, one_hot_max_size=8).fit(pool)

model.plot_tree(tree_idx=1,pool=pool)

Hope this clarifies your question about why Sex from titanic is being displayed in a different manner as compared to the features that you are working with.

For more reading on this check these links -

How to get catboost visualization to show the categories

Answers (1)

How are feature-splits selected?

Types of feature-splits

Analyzing the dataset at hand

Related Questions