How Huffman coding figured out the property that the codes are unique

Question

I have just read this:

This is where a really smart idea called Huffman coding comes in! The idea is that we represent our characters (like a, b, c, d, ….) with codes like
a: 00
b: 010
c: 011
d: 1000
e: 1001
f: 1010
g: 1011
h: 1111
If you look at these carefully, you’ll notice something special! It’s that none of these codes is a prefix of any other code. So if we write down 010001001011 we can see that it’s 010 00 1001 011 or baec! There wasn’t any ambiguity, because 0 and 01 and 0100 don’t mean anything.

I get the gist of this, but I don't understand (a) how it was figured out, and (b) how you know it works, or (c) exactly what it means. Specifically this line describes it:

So if we write down 010001001011 we can see that it’s 010 00 1001 011....

I see that those are the codes, but I don't understand how you know not to read it as 0100 01 0010 11. I see that these values aren't actually codes in the table. However, I don't see how you would ever figure this out! I would like to know how to discover this. If I were trying to tinker with codes and bits like this, I would do this:

Come up with a set of codes, like 10 100 1000 101 1001
Try writing out some examples of the codes. So maybe an example is just concatenating the codes in order above: 1010010001011001.
See if I can parse the codes. So 10 or oops, nope 101 also... Darnit, well maybe I can add a priority to the parsing of the code, and so 10 is higher priority than 101. That gets me to 10 100 1000 10 x nope that last 10 should be 101. Dangit.

So I would try adding different features like that priority function, or other things I can't think of at the moment, to see if it would help solve the problem.

I can't imagine how they would figure out that these codes in the Huffman coding could be uniquely parsed (I still don't see it, how it's actually true, I would have to write out some examples to see it, or, ... that's part of the question, how to see it's true, how to prove it even). Wondering if one could explain in more depth how it's proven to work, and how it was discovered (or how to discover something similar to it on your own).

Alain Merigot · Accepted Answer

Huffman code works by laying out data in a tree. If you have a binary tree, you can associate every leaf to a code by saying that left child corresponds to a bit at 0 and right child to a 1. The path that leads from the root to a leaf corresponds to a code in a not ambiguous way.

This works for any tree and the prefix property is based on the fact that a leaf is terminal. Hence, you cannot go to leaf (have a code) by passing though another leaf (by having another code be a prefix).

The basic idea of Huffman coding is that you can build trees in such a way that the depth of every node is correlated with the probability of appearance of the node (codes more likely to happen will be closer the root).

There are several algorithms to build such a tree. For instance, assume you have a set of items you want to code, say a..f. You must know the probabilities of appearance every item, thanks to either a model of the source or an analysis of the actual values (for instance by analysing the file to code).

Then you can:

sort the items by probability
pickup the two items with the lowest probability
remove these items, group them in a new compound node and assign one item to left child (code 0) and the other to right child (code 1).
The probability of the compound node is the sum of individual probabilities and insert this new node in the sorted item list.
goto 2 while the number of items is >1

For the previous tree, it may correspond to a set of probabilities

a (0.5) b (0.2) c (0.1) d (0.05) e (0.05) f (0.1)

Then you pick items with the lowest probability (d and e), group them in a compound node (de) and get the new list

a (0.5) b (0.2) c (0.1) (de) (0.1) f (0.1)

And the successive item lists can be

a (0.5) b (0.2) c(de) (0.2) f (0.1)

a (0.5) b (0.2) (c(de))f (0.3)

a (0.5) b((c(de))f) (0.5)

a(b(((c(de))f)) 1.0

So the prefix property is insured by construction.

How Huffman coding figured out the property that the codes are unique

Answers (1)

Related Questions