Reputation: 17
I had a question on splitting the node. I have 4 features and want to predict if the person will play, maybe play or not play. Based on the Information Gain, I have Weather as the first feature to split on which gives me Rainy, Hot and Humid as the branches. Rainy results in a pure Yes prediction. Hot and Humid do not. I am trying to determine which feature value (Hot or Humid?) should I select to grow / split next. I know that I can select the next feature depending on the max information gain. The next feature that has the max Information Gain is Gender. But I don't know if I should use Hot to go further down or Humid?
Weather
Rainy Hot Humid
Yes
Gender YoungOrOld Weather Mood Play?
Male 0 Hot Bad Yes
Male 1 Hot OK Yes
Female 1 Hot OK Maybe
Female 0 Hot Bad Yes
Male 1 Hot OK Yes
Male 0 Humid OK Yes
Female 1 Humid OK Maybe
Female 1 Rainy Good No
Male 2 Rainy OK No
Female 2 Rainy Good No
Upvotes: 0
Views: 97
Reputation: 77910
You have already split on weather and gender. weather == Rainy needs no more splitting else gender = Male needs no more splitting
The split you propose would be Hot vs Humid, but this doesn't gain anything. Instead, split on YoungOrOld. The two Female '1' entries are Maybe; everyone else is Yes. Now all nodes are pure.
Upvotes: 0
Reputation: 9410
You have divided samples of your dataset by feature "Weather", now you see that when "Weather=Rainy" samples in a node are pure, so you don't have to split this node from here, unlike other non-pure nodes where "Weather=Hot" or "Weather=Humid". Because of impurity, by default you should split both of them. But you can specify your own stopping criterion, besides stopping when node is pure, you can specify minimum number of samples required to split a node, and then stop division of node not only when it is pure, but also when there are too little of samples in node to perform split.
Upvotes: 1