Reputation: 21233
Consider the following data:
import pandas as pd
y_train = pd.DataFrame({0: {14194: 'Fake', 13891: 'Fake', 13247: 'Fake', 11236: 'Fake', 2716: 'Real', 2705: 'Real', 16133: 'Fake', 7652: 'Real', 7725: 'Real', 16183: 'Fake'}})
X_train = pd.DataFrame({'one': {14194: 'e',
13891: 'b',
13247: 'v',
11236: 't',
2716: 'e',
2705: 'e',
16133: 'h',
7652: 's',
7725: 's',
16183: 's'},
'two': {14194: 'a',
13891: 'a',
13247: 'e',
11236: 'n',
2716: 'c',
2705: 'a',
16133: 'n',
7652: 'e',
7725: 'h',
16183: 'e'},
'three': {14194: 's',
13891: 'l',
13247: 'n',
11236: 'c',
2716: 'h',
2705: 'r',
16133: 'i',
7652: 'r',
7725: 'e',
16183: 's'},
'four': {14194: 'd',
13891: 'e',
13247: 'r',
11236: 'g',
2716: 'o',
2705: 'r',
16133: 'p',
7652: 'v',
7725: 'r',
16183: 'i'},
'five': {14194: 'f',
13891: 'b',
13247: 'o',
11236: 'b',
2716: 'i',
2705: 'i',
16133: 'i',
7652: 'i',
7725: 'b',
16183: 'i'},
'six': {14194: 'p',
13891: 's',
13247: 'l',
11236: 'l',
2716: 'n',
2705: 'n',
16133: 'n',
7652: 'l',
7725: 'e',
16183: 'u'},
'seven': {14194: 's',
13891: 's',
13247: 's',
11236: 'e',
2716: 'g',
2705: 'g',
16133: 's',
7652: 'e',
7725: 't',
16183: 'r'}})
and the following code:
from catboost import CatBoostClassifier
from catboost import Pool
cat_features = list(X_train.columns)
pool = Pool(X_train, y_train, cat_features=list(range(7)), feature_names=cat_features)
model = CatBoostClassifier(verbose=0).fit(pool)
model.plot_tree(
tree_idx=1,
pool=pool # "pool" is required parameter for trees with one hot features
)
I get the following:
But I don't understand what {five} pr_num0 tb0 type0, value>8 means. I was hoping it would look like the titanic example from the manual which is:
import catboost
from catboost import CatBoostClassifier, Pool
from catboost.datasets import titanic
titanic_df = titanic()
X = titanic_df[0].drop('Survived',axis=1)
y = titanic_df[0].Survived
is_cat = (X.dtypes != float)
for feature, feat_is_cat in is_cat.to_dict().items():
if feat_is_cat:
X[feature].fillna("NAN", inplace=True)
cat_features_index = np.where(is_cat)[0]
pool = Pool(X, y, cat_features=cat_features_index, feature_names=list(X.columns))
model = CatBoostClassifier(
max_depth=2, verbose=False, max_ctr_complexity=1, iterations=2).fit(pool)
model.plot_tree(
tree_idx=0,
pool=pool
)
This gives:
How can I get the equivalent of Sex, value = Female
for my example? That would be for example, One, value = b
.
Upvotes: 4
Views: 3129
Reputation: 19322
TLDR; This is not really a visualization problem but more on how a feature-split is done in Catboost.
Catboost decides which feature to one-hot and which to ctr based on a parameter called one_hot_max_size
. If the number of classes in a feature is <= one_hot_max_size
then it would be treated as a one-hot. By default its set to 2. So only binary features (0,1 or male,female) are considered as one-hot and others (such as PClass -> 1,2,3) are handled as ctr. Setting it high enough will allow you to force catboost to encode your columns as one-hot.
The {five} pr_num0 tb0 type0, value>8
is basically a label, value for a ctr split. There is no documentation available for this but after inspecting the github repo, it seems the label is generated using a multi-hash.
More details below.
A feature-split
pair is chosen for a leaf in 3 steps:
There are three types of splits: FloatFeature
, OneHotFeature
and OnlineCtr
. These are based on the encoding that is done on the features.
9, border<257.23 #feature index, border value
max of n possible values (0 or 1)
. The n
is decided by a parameter called one_hot_max_size
which is set to 2 by default. Note in the titanic dataset case, Sex
has only 2 possible values, Male
or Female
. If you set one_hot_max_size=4
then catboost uses one hot to encode features with upto 4 unique classes (e.g. Pclass in titanic has 3 unique classes). A one-hot feature is represented with the feature name, and its value:Sex, value=Female #feature name, value
one_hot_max_size
then catboost automatically uses ctr to encode features and thus type of split is an OnlineCtr. Its represented with the feature name, some dummy tokens that represent unique classes and a value:{five} pr_num1 tb0 type0, value>9 #Label, value
##Inspecting github, the label seems to be from a multihash
##The multihash seems to be made from (CatFeatureIdx, CtrIdx, TargetBorderIdx, PriorIdx)
##https://github.com/catboost/catboost/blob/master/catboost/libs/data/ctrs.h
Let's first look at the number of unique classes in each of the features.
from catboost import CatBoostClassifier, Pool
import pandas as pd
X_train.describe().loc['unique']
one 6
two 5
three 8
four 8
five 4
six 6
seven 5
Name: unique, dtype: object
As you can see, the minimum number of unique classes is 4 (in feature called "five") and the maximum is 8. Lets set our one_hot_max_size = 4
.
cat_features = list(X_train.columns)
pool = Pool(X_train, y_train, cat_features=list(range(7)), feature_names=cat_features)
model = CatBoostClassifier(verbose=0, one_hot_max_size=4).fit(pool)
model.plot_tree(tree_idx=1,pool=pool)
The feature "five" is now OneHotFeature
and results in a split description of five, value=i
. However, feature "One" is still a OnlineCtr
.
Lets now set the one_hot_max_size = 8
, which is the max possible unique classes. This would ensure that each of the features is OneHotFeature
and not OnlineCtr
cat_features = list(X_train.columns)
pool = Pool(X_train, y_train, cat_features=list(range(7)), feature_names=cat_features)
model = CatBoostClassifier(verbose=0, one_hot_max_size=8).fit(pool)
model.plot_tree(tree_idx=1,pool=pool)
Hope this clarifies your question about why Sex
from titanic is being displayed in a different manner as compared to the features that you are working with.
For more reading on this check these links -
Upvotes: 8