dharam
dharam

Reputation: 8106

Identifying the positions of characters in Huffman Coding Algorithm

I am reading Huffman Coding Algorithm to encode a string. I can see that the frequency of the characters is taken into account to make a tree.

Here is the frequency table :

a   b   d   e   f   h   i   k   n   o   r   s   t   u   v 
5   1   3   7   3   1   1   1   4   1   5   1   2   1   1   9

*space has frequency 9

I can see there is a tree made with this. But I am not able to derive a rule how to place elements in tree.

The book says that all the characters with higher frequency should be near the root. But if more than two characters are of same frequency then they have to be on different sides of the root.

The question is, how do we decide the position?

IN my book a has the code 010, r has 011 and e has the code 100.

Can anyone please help?

Upvotes: 0

Views: 2396

Answers (2)

Mark Adler
Mark Adler

Reputation: 112414

Once you have your tree, then it is an arbitrary choice for how the 0 and 1 is assigned to the two branches at each fork in the road. So without a way to make that assignment canonical, there is no "right answer" to how to assign the bits to each symbol, e.g. that r must be 011. r could be any three-bit value. (Though it must be three bits in length for this set of frequencies.)

All that matters is that the decoder gets the same assignment of 0's and 1's as the encoder. Either you can send the codes directly, or you can send the lengths and assign the 0's and 1's in a canonical manner. As an example, the compression algorithm used in zip, gzip, png, etc. sends only the number of bits for each symbol. Then starting with the smallest length, all symbols of that length are assigned codes starting at 0. The symbols are assigned the codes in order with the symbols sorted by their representation integer. E.g. ASCII-sorted order for characters. For the next length, bits are added on the right and the code counting continues. This assures a proper prefix code, decoding from left to right.

So in this case, the code lengths are:

2: _
3: a, e, r
4: d, f, n
5: b, h, t
6: i, k, o, s, u, v

So we get (with symbols in alphabetic order within each length):

_: 00
a: 010
e: 011
r: 100
d: 1010
f: 1011
n: 1100
b: 11010
h: 11011
t: 11100
i: 111010
k: 111011
o: 111100
s: 111101
u: 111110
v: 111111

The bits assignments here are different than what's in your book for two of the three symbols. As examples of other perfectly good canonical prefix code choices, you could invert all of the bits, or you could invert any subsets of the columns of bits. E.g. you could invert the whole first column. You could change the order of symbols in each length. You can reverse the bit order. In fact, zip, etc. stores the bits shown above in reverse order, so decoding is done from the least significant bit first, i.e. right to left.

Upvotes: 2

Sufian Latif
Sufian Latif

Reputation: 13356

Have you tried Wikipedia? There's a nice demonstration on Huffman coding. The algorithm is simple enough: you need a priority queue.

The algorithm is somewhat like this:

1. Create tree nodes with each character and their frequencies
2. Put all the letters and their frequencies in a priority queue Q
3. Do until Q contains only one element:
    3a. Pick two lowest-frequency items a, b
    3b. Create a tree node z with frequency(z) = frequency(a) + frequency(b)
    3c. Add a and b as left and right children of z
    3d. Put z in Q
4. Pick up the only element from Q. This would be the root of the tree.
5. Assign binary codes to each leaf node according to their root-to-leaf path.

The priority queue should be designed as a min-priority queue, i.e. the node with lowest frequency should come out first. For handling equal-frequency items, use some other criteria (e.g. alphabetical order) as a tie-breaker. Be consistent with the tie-breaking criteria for both encoding and decoding.

Upvotes: 4

Related Questions