Reading bit by bit for Huffman Compression

Question

I'm writing a python program that implements the Huffman Compression. However, it seems that I can only read / write to bin file byte by byte instead of bit by bit. Is there any workaround for this problem? Wouldn't processing byte by byte defeat the purpose of compression since extraneous padding would be needed. Also, it'd be great if someone can enlighten me about the application of Huffman Compression with regards to this byte-by-byte problem. w

user555045 · Accepted Answer

A potential way to only have to read bytes is by buffering directly in the decoding routine. This combines well with table-based decoding, and does not have the overhead of ever doing bit-by-bit IO (hiding that with layers of abstraction doesn't make it go away, just wipes it under the carpet).

In the simplest case, table based decoding needs a "window" of the bit stream that is as large as¹ the largest possible code (incidentally this sort of thing is a large part of the reason why many formats that use Huffman compression specify a maximum code length that isn't super long²), which can be created by shifting a buffer to the right until it has the correct size:

window = buffer >> (maxCodeLen - bitsInBuffer)

Since this gets rid of excess bits anyway, it is safe to append more bits than strictly necessary to the buffer when there are not enough:

while bitsInBuffer < maxCodeLen:
    buffer = (buffer << 8) | readByte()
    bitsInBuffer += 8

Thus byte-IO is sufficient. Actually you could read slightly bigger blocks (eg two bytes at the time) if you wanted. By the way there is a slight complication here: if all bytes of a file have been read and the buffer does not have enough bits in it (which is a legitimate condition that can happen for valid bitstreams) you just have to fill with "padding" (basically shift left without ORing in new bits).

Decoding itself could look like this:

# this line does the actual decoding
(symbol, length) = table[window]
# remove that code from the buffer
bitsInBuffer -= length
buffer = buffer & ((1 << bitsInBuffer) - 1)
# use decoded symbol

This is all very easy, the hard part is constructing the table. One way to do it (not a great way, but a simple way) is to take every integer from 0 up to and including (1 << maxCodeLen) - 1 and decoding the first symbol in it using bit-by-bit tree-walking the way you're used to. A faster way is taking every symbol/code pair and using it to fill the right entries of the table:

# for each symbol/code do this:
bottomSize = maxCodeLen - codeLen
topBits = code << bottomSize
for bottom in range(0, (1 << bottomSize) - 1):
    table[topBits | bottom] = (symbol, codeLen)

By the way none of this code has been tested, it's just to show roughly how it might be done. It also assumes a particular way of packing the bitstream into bytes, with the first bit in the top of the byte.

1: some multi-stage decoding strategies are able to use a smaller window, which may be required if there is no bound on the code length.

2: eg 15 bits max for Deflate

Reading bit by bit for Huffman Compression

Answers (2)

Related Questions