ribs2spare
ribs2spare

Reputation: 241

Calculating the entropy of a specific attribute?

This is super simple but I'm learning about decision trees and the ID3 algorithm. I found a website that's very helpful and I was following everything about entropy and information gain until I got to

this point on the page.

I don't understand how the entropy for each individual attribute (sunny, windy, rainy) is calculated--specifically, how p-sub-i is calculated. It seems different than the way it is calculated for Entropy(S). Can anyone explain the process behind this calculation?

Upvotes: 1

Views: 4268

Answers (2)

Calc proportion that sunny represents on set S, i.e., |sunnyInstances| / |S| = 3/10 = 0.3.

Apply the entropy formula considering only sunny entropy. Theres 3 sunny instances divided into 2 classes being 2 sunny related with Tennis and 1 related to Cinema. So the entropy formula for sunny gets something like this: -2/3 log2(2/3) - 1/3 log2(1/3) = 0.918

And so on.

Upvotes: 3

Elliott Addi
Elliott Addi

Reputation: 380

To split a node into two different child nodes, one method consists splitting the node according to the variable that can maximise your information gain. When you reach a pure leaf node, the information gain equals 0 (because you can't gain any information by splitting a node containing only one variable - logic).

In your example Entropy(S) = 1.571 is your current entropy - the one you have before splitting. Let's call it HBase. Then you compute the entropy depending on several splittable parameters. To get your Information Gain, you substract the entropy of your child nodes to HBase -> gain = Hbase - child1NumRows/numOfRows*entropyChild1 - child2NumRows/numOfRows*entropyChild2

def GetEntropy(dataSet):
    results = ResultsCounts(dataSet)
    h = 0.0   #h => entropy

    for i in results.keys():
        p = float(results[i]) / NbRows(dataSet)
        h = h - p * math.log2(p)
    return h

def GetInformationGain(dataSet, currentH, child1, child2):
    p = float(NbRows(child1))/NbRows(dataSet)
    gain = currentH - p*GetEntropy(child1) - (1 - p)*GetEntropy(child2)
    return gain

The objective is to get the best of all Information Gains!

Hope this helps! :)

Upvotes: 1

Related Questions