Reputation: 195
The values of the value attribute corresponding to the decision tree classifier stubs being used with an AdaBoostClassifier are not matching expectations and I can not determine what the values are indicating. I would like to understand the values to aid in analyzing the behavior of the stub estimator and the contribution the stubs are making to the AdaBoostClassifier. Similar questions to Stackoverflow don't correlate with my data.
Version information
The DecisionTreeClassifier stubs are configured as:
number_estimators = 301
bdt= AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
algorithm="SAMME.R", n_estimators=number_estimators)
The AdaBoostClassifier is a binary classifier with output states Class A and Class B (encoded as +1 and -1). The training set consists of 23 features and the classifier is performing ok (prediction accuracy, precision, recall all approximately 79%). I am analyzing missed prediction to get some insights into the classification errors.
There are 782 training samples. The 301 stub estimators are obtained from the AdaBoostClassifier via:
tree_stubs = bdt.estimators_
An example stub corresponding to the 6th estimator (0 based list):
bdt.estimators_[5]
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=421257592, splitter='best')
The values for this stub:
stub_5.tree_.value
array([[[0.5 , 0.5 ]],
[[0.29308331, 0.1861591 ]],
[[0.20691669, 0.3138409 ]]])
For those familiar with graphviz, the tree stub looks like:
The root node correctly displays the number of samples (782). The value attribute indicates [0.5, 0.5]. I was expecting the value attribute to be the number of samples in each class not a percentage. But in the root node the 0.5 values do reflect the balanced training set I have with equal representation of the two classes.
Now for the problem. The splitting feature in this stub divides the samples based upon the delta_win_pct value being less than or equal to a threshold value of -.001. My data set does indeed have 385 sample records where the delta_win_pct is less than this threshold and 397 samples where the delta_win_pct is greater than the threshold. So the samples data is correct in the left and right leaf nodes of the tree stub.
But the value data appears to be incorrect. In the left child node the values are reported as value=[0.293, 0.186], and in the right child node the value=[0.207, 0.314]. Note, this is data reported by the sklearn.tree._tee.Tree class and is not indicative of any problem with graphviz.
What do these value quantities represent ?
Considering the left leaf node, my data set actually has 264 Class A samples whose delta_win_pct <= -0.001 and and 121 Class B samples matching this splitting threshold. These numbers correspond to percentages of [.6857, .3143] not [0.293, 0.186]. The incorrect values do not linearly scale to the expected values.
Similarly for the right child node the value data is provided as [0.207, 0.314] but the expected values should be [.330, .670] for the 397 samples whose delta_win_pct exceeds the threshold.
I notice that the numbers in the provided value data (0.293, 0.186, 0.207, 0.314) add up to 1.0. But the values do not add up to 1.0 for each node. I tried using the provided values as percentages of all the samples, e.g. 0.293 * 782 = 229 which doesn't correlate to anything.
Does anyone have any insight into what the provided value data means? Is my interpretation and expectation of these values incorrect?
Finally, I notice that the relative magnitude of the values in the data correctly correlate to the majority samples in each node. In the left child node 0.293 is greater than 0.186 indicating that the left node has a majority of Class A samples. While in the right leaf node 0.207 < 0.314 indicating a majority of Class B samples when the delta_win_pct > the threshold value. I suspect that this is why the AdaBoostClassifier appears to be working.
Anyway, I'd like to understand these value values.
Upvotes: 3
Views: 2019
Reputation: 11
In AdaBoost every data point is assigned a weight. Initially, all the weights are equal (1/the total number of samples). In AdaBoost the trees are trained sequentially. After we train the first tree, the weight of each data point is adjusted depending on the errors that the first tree made. So, when we start training the second tree the weights of the data points are different. So, value in this case represents the sum of the weights of the data points per class.
Upvotes: 1
Reputation: 1
Somehow the values array represents the expected outcome when it is not a balanced problem. When model parameter is NOT set to class_weight = 'Balanced', then the values give the proportion of class A and class B within that node; but when model parameter is set to class_weight = 'Balanced', then the values give unexpected output.
Upvotes: 0
Reputation: 5896
I tried reproducing it on a generated dataset:
import pydot
import numpy as np
from IPython.display import Image, display
from sklearn.externals.six import StringIO
from sklearn.tree import DecisionTreeClassifier, _tree
from sklearn.datasets import make_classification
from sklearn.ensemble import AdaBoostClassifier
X, y = make_classification(n_informative=2, n_features=3, n_samples=200, n_redundant=1, random_state=42, n_classes=2)
feature_names = ['X0','X1','X2','X3']
clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
algorithm="SAMME.R", n_estimators=301)
clf.fit(X, y)
estimator = clf.estimators_[0]
dot_data = StringIO()
tree.export_graphviz(estimator, out_file=dot_data, proportion=False, filled=True,node_ids=True,rounded=True,class_names=['0','1'])
graph = pydot.graph_from_dot_data(dot_data.getvalue()) [0]
def viewPydot(pdot):
plt = Image(pdot.create_png())
display(plt)
viewPydot(graph)
I found that there are two cases, a "proper" (clf.estimators_[0]
) one which looks like this
Here value
stands for a proportion of a particular class in a node relative to a total number of samples, so node#1:[84/200=0.42,7/200=0.035]
, node#2: [16/200=0.08,93/200=0.465]
If you set proportion
parameter to True
you will get class distribution for each node as percentage, e.g. for node#2: [16/109, 93/109]=[0.147, 0.853]
. It is calculated using weighted_n_node_samples
attribute which in proper case equals number of samples for a node divided by total number of samples, e.g. 109/200=0.545, [0.08, 0.465]/0.545=[0.147, 0.853]
Another case (clf.estimators_[4]
) is the one you encountered:
Left node classes: [74, 7]
Rignt node classes: [93, 26]
Class distribution here does not correlate with value
, left node even predicts the minority class.
The only proper case seems to be the first estimator, others have this problem, maybe it is a part of the boosting procedure? Also, if you take any estimator tree and fit it manually you will get the same numbers as in the first one e.g.
>>> DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=441365315, splitter='best').fit(X,y).tree_.value
array([[[100., 100.]],
[[ 84., 7.]],
[[ 16., 93.]]])
Upvotes: 2