Reputation: 13
So I learned about the concept of information entropy from Khan Academy where is was phrased in the form of "average amount of yes or no questions needed per symbol". They also gave an alternative form using logarithms.
So let's say we have a symbol generator that produces A,B, and C. P(A)=1/2, P(B)=1/3, and P(C)=1/6 According to their method, I would gat a chart like this: First method
Then I would multiply their probability of occurring by the amount of questions needed for each giving (1/2)*1+(1/3)*2+(1/6)*2 = 1.5bits
but their other method gives -(1/2)log2(1/2)-(1/3)log2(1/3)-(1/6)log2(1/6)= 1.459... bits
The difference is small, but still significant. I've tried this with different combinations and probabilities and got similar results. Is there something I'm missing? Am I using either method wrong, or is one of them more conditional?
Upvotes: 1
Views: 35
Reputation: 41484
Your second calculation is correct.
The problem with your decision tree approach is that the decision tree is not optimal (and indeed, no binary decision tree could be for those probabilities). Your “is it B” decision node represents less than one bit of information, since once you get there you already know it’s probably B. So your decision tree represents a potential encoding of symbols which is expected to consume 1.5 bits on average, but it represents slightly less than 1.5 bits of information.
In order to have a binary tree which represents an optimal encoding, each node needs to have balanced probabilities. This is not possible if some symbol has a probability whose denominator is not a power of 2.
Upvotes: 1