Reputation: 79268
I have just read this:
This is where a really smart idea called Huffman coding comes in! The idea is that we represent our characters (like a, b, c, d, ….) with codes like
a: 00 b: 010 c: 011 d: 1000 e: 1001 f: 1010 g: 1011 h: 1111
If you look at these carefully, you’ll notice something special! It’s that none of these codes is a prefix of any other code. So if we write down
010001001011
we can see that it’s010 00 1001 011
orbaec
! There wasn’t any ambiguity, because0
and01
and0100
don’t mean anything.
I get the gist of this, but I don't understand (a) how it was figured out, and (b) how you know it works, or (c) exactly what it means. Specifically this line describes it:
So if we write down
010001001011
we can see that it’s010 00 1001 011
....
I see that those are the codes, but I don't understand how you know not to read it as 0100 01 0010 11
. I see that these values aren't actually codes in the table. However, I don't see how you would ever figure this out! I would like to know how to discover this. If I were trying to tinker with codes and bits like this, I would do this:
10 100 1000 101 1001
1010010001011001
.10
or oops, nope 101
also... Darnit, well maybe I can add a priority to the parsing of the code, and so 10
is higher priority than 101
. That gets me to 10 100 1000 10 x
nope that last 10 should be 101. Dangit.So I would try adding different features like that priority function, or other things I can't think of at the moment, to see if it would help solve the problem.
I can't imagine how they would figure out that these codes in the Huffman coding could be uniquely parsed (I still don't see it, how it's actually true, I would have to write out some examples to see it, or, ... that's part of the question, how to see it's true, how to prove it even). Wondering if one could explain in more depth how it's proven to work, and how it was discovered (or how to discover something similar to it on your own).
Upvotes: 2
Views: 849
Reputation: 11537
Huffman code works by laying out data in a tree. If you have a binary tree, you can associate every leaf to a code by saying that left child corresponds to a bit at 0 and right child to a 1. The path that leads from the root to a leaf corresponds to a code in a not ambiguous way.
This works for any tree and the prefix property is based on the fact that a leaf is terminal. Hence, you cannot go to leaf (have a code) by passing though another leaf (by having another code be a prefix).
The basic idea of Huffman coding is that you can build trees in such a way that the depth of every node is correlated with the probability of appearance of the node (codes more likely to happen will be closer the root).
There are several algorithms to build such a tree. For instance, assume you have a set of items you want to code, say a..f. You must know the probabilities of appearance every item, thanks to either a model of the source or an analysis of the actual values (for instance by analysing the file to code).
Then you can:
For the previous tree, it may correspond to a set of probabilities
a (0.5) b (0.2) c (0.1) d (0.05) e (0.05) f (0.1)
Then you pick items with the lowest probability (d and e), group them in a compound node (de) and get the new list
a (0.5) b (0.2) c (0.1) (de) (0.1) f (0.1)
And the successive item lists can be
a (0.5) b (0.2) c(de) (0.2) f (0.1)
a (0.5) b (0.2) (c(de))f (0.3)
a (0.5) b((c(de))f) (0.5)
a(b(((c(de))f)) 1.0
So the prefix property is insured by construction.
Upvotes: 5