Reputation: 3445
I am following the excellent talk on Pandas and Scikit learn given by Skipper Seabold.
I am utilizing his cleaned data set that originates from UCI adult names.
Upon running this code and generating the tree image via graphviz, we can observe there are value data on each node in the tree.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_graphviz
dta = pd.read_csv("data/adult.data.cleaned.csv")
for col in dta:
if not dta[col].dtype.kind == "O":
continue
if dta[col].str.contains("\?").any():
dta.ix[dta[col].str.contains("\?"), col] = "Other"
test.ix[test[col].str.contains("\?"), col] = "Other"
dta.income.replace({"<=50K": 0, ">50K": 1}, inplace=True)
test.income.replace({"<=50K": 0, ">50K": 1}, inplace=True)
y = dta.pop("income")
y_test = test.pop("income")
X_train = pd.get_dummies(dta)
X_test = pd.get_dummies(test)
X_test[X_train.columns.difference(X_test.columns)[0]] = 0
dtree = DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=6)
dtree.fit(X_train, y)
export_graphviz(dtree, feature_names=X_train.columns)
What do the value properties represent?
EDIT: Meaning in every node there is a value=[x, y]
property
Upvotes: 6
Views: 7554