Reputation: 519
Let's say that I've encoded my Huffman tree in with the compressed file. So I have as an example file output:
001A1C01E01B1D
I'm having an issue saving this string to file bit-by-bit. I know that C++ can only output to file one byte at a time, so I'm having an issue storing this string in bytes. Is it possible to convert the first three bits to a char without the program padding to a byte? If it pads to a byte for the traversal codes then my tree (And the codes) will be completely messed up. If I were to chop this up one byte at a time, then what happens if the tree isn't exactly a multiple of 8? What happens if the compressed file's bit-length isn't exactly a multiple of 8?
Hopefully I've been clear enough.
Upvotes: 1
Views: 529
Reputation: 2327
The standard solution to this problem is padding. There are many possible padding schemes. Padding schemes pad up to an even number of bytes (i.e., a multiple of 8 bits). Additionally, they encode either the length of the message in bits, or the number of padding bits (from which the message length in bits can be determined by subtraction). The latter solution obviously results in slightly more efficient paddings.
Most simply, you can append the number of "unused" bits in the last byte as an additional byte value.
One level up, start by assuming that the number of padding bits fits in 3 bits. Define the last 3 bits of an encoded file to encode the number of padding bits. Now if the message uses up no more than 5 bits of the last byte, the padding can fit nicely in the same byte. If it is necessary to add a byte to contain the padding, the maximum gap is 5+2=7 (5 from the unused high bits of the extra byte, and 2 is the maximum possible space free in the last byte, otherwise the 3-bit padding value would've fit there). Since 0-7 is representable in 3 bits, this works (it doesn't work for 2 bits, since the maximum gap is larger and the range of representable values is smaller).
By the way, one of the main advantages of placing the padding information at the end of the file (rather than as a header at the beginning of the file) is that the compression functions can then operate on a stream without having to know its length in advance. Decompression can be stream-based as well, with careful handling of EOF signals.
Upvotes: 1
Reputation: 112374
Simply treat a sequence of n bytes as a sequence of 8n bits. Use the >>
or <<
, |
, and &
operators to assemble bytes from the sequence of variable-length bit codes.
The end of the stream is important to handle properly. You need an end of stream code so that the decoder knows to stop and not try to decode the final padding bits that complete the last byte.
Upvotes: 1