What do the values that `graphviz` renders inside each node of a decision tree mean?

Question

For the image above using the AdaBoostClassifier library from scipy and graphviz I was able to create this subtree visual and I need help interpreting the values that are in each node? Like for example, what does "gini" mean? What is the significance of the "samples" and "value" fields? What does it mean that attribute F5 <= 0.5?

Here is my code (I did this all in jupyter notebook):

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

f = open('dtree-data.txt')
d = dict()
for i in range(1,9):
    key = 'F' + str(i)
    d[key] = []
d['RES'] = []
for line in f:
    values = [(True if x == 'True' else False) for x in line.split()[:8]]
    result = line.split()[8]
    d['RES'].append(result)
    for i in range(1, 9):
        key = 'F' + str(i)
        d[key].append(values[i-1])
df = pd.DataFrame(data=d, columns=['F1','F2','F3','F4','F5','F6','F7','F8','RES'])

from sklearn.model_selection import train_test_split

X = df.drop('RES', axis=1)
y = df['RES']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)

from IPython.display import Image
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import pydot

# https://stackoverflow.com/questions/46192063/not-fitted-error-when-using-sklearns-graphviz 

sub_tree = ada.estimators_[0]
dot_data = StringIO()
features = list(df.columns[1:])
export_graphviz(sub_tree, out_file=dot_data,feature_names=features,filled=True,rounded=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())  
Image(graph[0].create_png())

NOTE: External packages may need to be installed in order to view the data locally (obviously)

Here is a link to the data file: https://cs.rit.edu/~jro/courses/intelSys/dtree-data

MB-F · Accepted Answer

A decision tree is a binary tree where each node represents a portion of the data. Each node that is not a leaf (root or branch) splits its part of the data in two sub-parts. The root node contains all data (from the training set). Furthermore, this is a classification tree. It predicts class probabilities - the node values.

Root/branch node:

samples = 134 that means the node 'contains' 134 samples. Since it's the root node that means the tree was trained on 134 samples.
value = [0.373, 0.627] are class frequencies. About 1/3 of the samples belong to class A and 2/3 to class B.
gini = 0.468 is the gini impurity of the node. It discribes how much the classes are mixed up.
F5 <= 0.5 What are the column names of the data? Right. This means that the node is split so that all samples where the feature F5 is lower than 0.5 go to the left child and the samples where the feature is higher than 0.5 go to the right child.

Leaf nodes:

These nodes are not further split, so there is no need for a F <= something field.
samples = 90 / 44 sum to 134. 90 samples went to the left child and 44 samples to the right child.
values = [0.104, 0.567] / [0.269, 0.06] are the class frequencies in the children. Most samples in the left child belong to class B (56% vs 10%) and most samples in the right child belong to class A (27% v 6%).
gini = 0.263 / 0.298 are the remaining impurities in the child nodes. They are lower than in the parent node, which means the split improved separability between the classes, but there is still some uncertainty left.

What do the values that `graphviz` renders inside each node of a decision tree mean?

Answers (1)

Related Questions