WDRKKS
WDRKKS

Reputation: 135

Decoding Huffman file from canonical form

I am writing a Huffman file where I am storing the code lengths of the canonical codes in the header of the file. And during decoding, I am able to regenerate the canonical codes and store them into a std::map<std:uint8_it, std::vector<bool>>. The actual data is read into a single std::vector<bool>. Before anyone suggests me to use std::bitset, let me clarify that Huffman codes have variable bit length, and hence, I am using std::vector<bool>. So, given that I have my symbols and their corresponding canonical codes, how do I decode my file? I don't know where to go from here. Can someone explain to me how I would decode this file since I couldn't find anything proper related to it on searching.

Upvotes: 5

Views: 1952

Answers (2)

Mark Adler
Mark Adler

Reputation: 112627

You do not need to create the codes or the tree in order to decode canonical codes. All you need is the list of symbols in order and the count of symbols in each code length. By "in order", I mean sorted by code length from shortest to longest, and within each code length, sorted by the symbol value.

Since the canonical codes within a code length are sequential binary integers, you can simply do integer comparisons to see if the bits you have fall within that code range, and if it is, an integer subtraction to determine which symbol it is.

Below is code from puff.c (with minor changes) to show explicitly how this is done. bits(s, 1) returns the next bit from the stream. (This assumes that there is always a next bit.) h->count[len] is the number of symbols that are coded by length len codes, where len is in 0..MAXBITS. If you add up h->count[1], h->count[2], ..., h->count[MAXBITS], that is the total number of symbols coded, and is the length of the h->symbol[] array. The first h->count[1] symbols in h->symbol[] have length 1. The next h->count[2] symbols in h->symbol[] have length 2. And so on.

The values in the h->count[] array, if correct, are constrained to not oversubscribe the possible number of codes that can be coded in len bits. It can be further constrained to represent a complete code, i.e. there is no bit sequence that remains undefined, in which case decode() cannot return an error (-1). For a code to be complete and not oversubscribed, the sum of h->count[len] << (MAXBITS - len) over all len must equal 1 << MAXBITS.

Simple example: if we are coding e with one bit, t with two bits, and a and o with three bits, then h->count[] is {0, 1, 1, 2} (the first value, h->count[0] is not used), and h->symbol[] is {'e','t','a','o'}. Then the code to e is 0, the code for t is 10, a is 110, and o is 111.

#define MAXBITS 15              /* maximum bits in a code */

struct huffman {
    short *count;       /* number of symbols of each length */
    short *symbol;      /* canonically ordered symbols */
};

int decode(struct state *s, const struct huffman *h)
{
    int len;            /* current number of bits in code */
    int code;           /* len bits being decoded */
    int first;          /* first code of length len */
    int count;          /* number of codes of length len */
    int index;          /* index of first code of length len in symbol table */

    code = first = index = 0;
    for (len = 1; len <= MAXBITS; len++) {
        code |= bits(s, 1);             /* get next bit */
        count = h->count[len];
        if (code - count < first)       /* if length len, return symbol */
            return h->symbol[index + (code - first)];
        index += count;                 /* else update for next length */
        first += count;
        first <<= 1;
        code <<= 1;
    }
    return -1;                          /* ran out of codes */
}

Upvotes: 12

muued
muued

Reputation: 1676

Your map contains the relevant information, but it maps symbols to codes. Yet, the data you are trying to decode comprises codes. Thus your map cant be used to get the symbols corresponding to the codes read in an efficient way since the lookup method expects a symbol. Searching for codes and retrieving the corresponding symbol would be a linear search.

Instead you should reconstruct the Huffman tree you constructed for the compression step. The frequency values of the inner nodes are irrelevant here, but you will need the leaf nodes at the correct positions. You can create the tree on the fly as you read your file header. Create an empty tree initially. For each symbol to code mapping you read, create the corresponding nodes in the tree. E.g. if the symbol 'D' has been mapped to the code 101, then make sure there is a right child node at the root, which has a left child node, which has a right child node, which contains the symbol 'D', creating the nodes if they were missing.

Using that tree you can then decode the stream as follows (pseudo-code, assuming taking a right child corresponds to adding a 1 to the code):

// use a node variable to remember the position in the tree while reading bits
node n = tree.root
while(stream not fully read) {
    read next bit into boolean b
    if (b == true) {
        n = n.rightChild
    } else {
        n = n.leftChild
    }
    // check whether we are in a leaf node now
    if (n.leftChild == null && n.rightChild == null) {
        // n is a leaf node, thus we have read a complete code
        // add the corresponding symbol to the decoded output
        decoded.add(n.getSymbol())
        // reset the search
        n = tree.root
    }
}

Note that inverting your map to get the lookup into the correct direction will still result in suboptimal performance (compared to binary tree traversal) since it can't exploit the restriction to a smaller search space as the traversal does.

Upvotes: 1

Related Questions