Igor L.
Igor L.

Reputation: 3445

What is the meaning of the value property in a generated scikit learn decision tree?

I am following the excellent talk on Pandas and Scikit learn given by Skipper Seabold.

I am utilizing his cleaned data set that originates from UCI adult names.

Upon running this code and generating the tree image via graphviz, we can observe there are value data on each node in the tree.

import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_graphviz

dta = pd.read_csv("data/adult.data.cleaned.csv")

for col in dta:
    if not dta[col].dtype.kind == "O":
        continue
    if dta[col].str.contains("\?").any():
        dta.ix[dta[col].str.contains("\?"), col] = "Other"
        test.ix[test[col].str.contains("\?"), col] = "Other"

dta.income.replace({"<=50K": 0, ">50K": 1}, inplace=True)
test.income.replace({"<=50K": 0, ">50K": 1}, inplace=True)

y = dta.pop("income")
y_test = test.pop("income")

X_train = pd.get_dummies(dta)
X_test = pd.get_dummies(test)

X_test[X_train.columns.difference(X_test.columns)[0]] = 0

dtree = DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=6)
dtree.fit(X_train, y)
export_graphviz(dtree, feature_names=X_train.columns)

What do the value properties represent? EDIT: Meaning in every node there is a value=[x, y] property

Final decision tree

Upvotes: 6

Views: 7554

Answers (1)

skrubber
skrubber

Reputation: 1095

Value is how the samples to test for information gain are split up. So at the root node, 32561 samples are divided into 24720 and 7841 samples each.

Nice explanation from S. Raschka here

Upvotes: 5

Related Questions