Decision trees. Choosing thresholds to split objects

Question

If I understand this correctly, a set of objects (which are arrays of features) is presented and we need to split it into 2 subsets. To do that we compare some feature x_j to a threshold t_m (t_m is the threshold at m node). We use an impurity function H() to find the best way to split the objects. But how do we choose the values of t_m and which feature should be compared to the thresholds? I mean, there is an infinite number of ways we can choose t_m so we can't just compute H() function for each possibility.

geledek · Accepted Answer

In Page 18 of these slides, two methods are introduced to choose the splitting threshold for a numerical attribute X.

Method 1:

Sort data according to X into {x_1, ..., x_m}
Consider split points of the form x_i + (x_{i+1} - x_i)/2

Method 2:

Suppose X is a real-value variable

Define IG(Y|X:t) as H(Y) - H(Y|X:t)
Define H(Y|X:t) = H(Y|X < t) P(X < t) + H(Y|X >= t) P(X >= t)
- IG(Y|X:t) is the information gain for predicting Y if all you know is whether X is greater than or less than t
Then define IG^*(Y|X) = max_t IG(Y|X:t)
For each real-valued attribute, use IG*(Y|X) for assessing its suitability as a split

Note, may split on an attribute multiple times, with different thresholds

Decision trees. Choosing thresholds to split objects

Answers (2)

Related Questions