Lynct
Lynct

Reputation: 211

Text compression: Strategy to code prefix(header) to improve the performance

I am currently working on a huffman text compression exercise, but i've encounter some issues with the coding of my header.I used my character-frequency table to save the information I need as header for decompression of the file (All converted into binary strings then save in byte arrays).

So at first, I used 2 bytes for each character, 1 byte for character and 1 byte for its frequency. However I realized it will not work for a large text which the frequency of some characters could surpass 255(1byte).

Therefore I made a modification, I adjust the reserved bytes for each character depending on its frequency. it look something like this:

public String freString(int freq, String s){
    freq = freq - 255;
    s = s + ("11111111");

    if(freq>=255){
        s = freString(freq,s);
    }else{

        String remainFreq=  Integer.toBinaryString(freq);
       //patch the ommited zeros of last byte
       remainFreq= String.format("%8s", remainFreq);
       remainFreq=  tempString.replace(' ', '0');
       s = s + remainFreq;
    }

    return s;


}

With this, during decompression I will look at the next byte to see what is its value, if its 255, then keep adding the value of next byte... etc.

Example of my header:

[ 9 , 141, 3, 142, 255, 33, 143, 255, 255, 2 ]

[length of my header = 9 ,a = 3, b = 288, c = 512]

It works fine, but it greatly reduced my compression ratio as text gets larger and larger. Ex: if the 'a' repeats 5000 times. I will en up using up to 20 bytes to store my string of frequency value instead of 2 bytes (00010011 10001000 = 5000)

So here is my question... is there a better strategy I can use to increase the reserved byte of a character dynamically and at the same time indicate the "end of freq string" ? I've though of reserving minimum of 3 bytes per character (1 for the char, 1 for its freq, and 1 to indicate end of freq string) but this will affect the compression ratio of smaller text file. Is this the trade off i must take? or there exist a better way of doing it?

Upvotes: 1

Views: 370

Answers (1)

user555045
user555045

Reputation: 64913

If you have a Huffman tree, then you can make many other Huffman trees that assigns the same length to all symbols but different code words by swapping the left and right child of any node. All those trees are equally good - they compress the data just as much because the lengths stay the same. Canonical Huffman is a way to agree beforehand how to choose one specific tree out of all those possible permuted trees, so that you don't have to communicate which one of those trees you're actually using.

What that means in practice is that the tree can be reconstructed from just the lengths. It's not necessary to actually reconstruct the tree, but the ability to reconstruct it means you have retained all information.

As is often the case, there are different choices you could make for which tree is the canonical one. The choice you make has some implications for decoding techniques, but that may be beyond the scope of this question. Anyway, one choice is to permute the tree such that

  1. the longest code is all zeroes
  2. shorter codes (padded with zeroes on the right to the same length as the longest code) are numerically greater than all longer codes
  3. symbols with the same length are assigned codes in ascending order

Rules 1 and 2 make a tree that is deepest on one side, shallowest on the other, with no weird jumps in between. Rule 3 orders nodes that are at the same depth.

Turns out though, you don't need to do any tree restructuring. Using just the length assigned to every symbol, the codes can be constructed easily, like this:

// create histogram of lengths
const int maxcodelength = 15; // 15 is the max code length for Deflate
uint[] length_count = new uint[maxcodelength + 1];;
for (int i = 0; i < symbollengths.Length; i++)
    length_count[symbollengths[i]]++;

// find starting point (lowest code) for each length
uint code = 0;
uint[] next_code = new uint[maxcodelength + 1];
next_code[maxcodelength] = 0;
for (int bits = maxcodelength - 1; bits >= 0; bits--)
{
    code = (code + length_count[bits + 1]) >> 1;
    next_code[bits] = code;
}

// assign codes to symbols
uint[] codes = new uint[256];
for (int n = 0; n < codes.Length; n++)
{
    int len = symbollengths[n];
    if (len != 0)
    {
        codes[n] = next_code[len];
        next_code[len]++;
    }
}

This is strongly related to the code on page 8 of rfc1951 (Deflate), but different (the shift goes the other way, resulting in the all-zero code having the longest length, in Deflate the all-zero code has the shortest length).

As for the header, now you need only 4 bits per symbol (if you also use the length limit of 15), certainly not more than 8 bits per symbol (codes longer than 256 would be kind of crazy). That would still be 128 or 256 bytes for the header (for an alphabet of 256). You could improve that, for example by borrowing Deflate's scheme of run-length encoding the lengths.


Additional stuff.

One way to guarantee the maximum length is not exceeded is to divide all frequencies by 2 (rounding up) and recreate the Huffman tree until the maximum length is no longer exceeded. There are also other ways to calculate a valid set of lengths, without building a tree, for example package-merge.

The length is limited in almost all compression formats that use Huffman coding. It's important for both encoding and decoding, mainly for decoding. For encoding, having codes no longer than 25 means that you can use a 32bit buffer and write out bytes (meaning that up to 7 bits can be left in the buffer) without needing a special case for when adding a code to the buffer would overflow. For decoding, a short(ish) maximum code length enables simple single-table lookup - it is indexed with maxcodelength bits (the "window") at the time, giving both the first symbol (the actual decoding) in the window and the length of that symbol (so it can be shifted out). If the maximum code length is longer, slightly more sophisticated techniques are required, such as multi-level tables, or my personal favourite, different tables depending on the number of leading zeroes in the window.

Upvotes: 0

Related Questions