Reputation: 68
I was suddenly presented with the profound question of how does a computer generate a decision tree.
For example, consider the problem of predicting a specific flower species using the random forest algorithm. There are two specific attributes of a flower (petal width, petal length) that distinguish the species.
In terms of the flower problem, when using sclearn's Python decisiontreemaker(), how does the computer figure out what petal width and petal length to create the tree? Also, does a random forest algorithm create these decision trees by brute forcing and testing every single variation of a tree by weighing entrophy?
Upvotes: 0
Views: 77
Reputation: 5355
Lets make it a bit simple; say you have 2 classes C1
and C2
and 1 feature X
;
X = [0,1,2,3,
4,5,6,7]
y = ["C1","C1","C1","C1",
"C2","C2","C2","C2"]
we can easily see that if X<4
the we output C1
else C2
.
So the way we do is, we check, for all features, "at which splitting point do we best seperate the (two) classes" or "at which splitting point do we create the purest nodes" (nodes that mostly contains one class). How we define "purity" is based on e.g gini
or entropy
.
Theres ways to optimize how to find those splits e.g by sorting the features first (take our example above, theres no need to check the split [0,1] since both of them contains C1
i.e the first split to consider should be between [3,4] since we have C1
to the left and C2
to the right).
If we then have 3 classes instead
X = [0,1,2,3,
4,5,6,7,
8,9, 10]
y = ["C1","C1","C1","C1",
"C2","C2","C2","C2",
"C3","C3","C3"]
we write our own "tree"
def predict(X):
if X<4:
return "C1"
if X<8:
return "C2"
return "C3"
which now looks a bit more like the decision tree we know.
Upvotes: 1