Reputation: 21
This is a two-part question from a newbie.
First, I need an encoding for simple text (without the lowercase/caps distinction), and I need it to be more space-efficient than ASCII. So I have thought of creating my own 5-bit code, holding a range of 32 characters (the alphabet plus some punctuation marks). As far as I understand, all modern computing 'thinks' in bytes, so I cannot actually define my own 5-bit encoding without actually resorting to a 8-bit encoding.
What I am thinking of doing is: I define my own 5-bit code and I save the text in 3-character blocks, each block saved as 2 bytes. Every block will occupy total of 15 bits, which will be stored within two bytes (holding 16 bits). I might use the extra bit for parity check, even if I don't actually need it. Does this approach make sense? Or is there any better approach? Alternatively, I might define a 6-bit encoding, and save the text into blocks of 4 characters each, with each block being saved in 3 bytes.
The second part of the question is: assuming that the text will then be compressed (via a standard lossless algorithm for texts, such as zip for example), is it worth the trouble of creating my own encoding (as explained above)? Or will the compression algorithm take care of the space inefficiency of the 8-bit encoding, making the compressed as efficient as a compressed file which originally was encoded using a 5-bit or 6-bit encoding? If so, I would have no advantage of using a 5/6 bit encoding for pre-compression text, so I would simply skip altogether this step. I need to know from experienced programmers, what is the case?
Thanks everyone
Upvotes: 1
Views: 394
Reputation: 112284
The compression algorithm will handle the coding for you more efficiently. It will use Huffman, Range, or Arithmetic coding to use a variable number of bits, even fractional bits, per letter, taking advantage of the statistics of your actual data. This will work much better if you don't try to precode the characters, stuffing them into less than 8-bits each. The compression algorithms count the statistics by the symbols found in each byte, and look for repeating patterns in the bytes.
Upvotes: 1
Reputation: 4988
You do not need to worry about 'blocks'. Just append the 5 bits to a 8 bit buffer, when this buffer is filled flush it out and push any remaining bits into the buffer.
The only ambiguity comes at the end of the message when you may have a partially filled buffer with number of yet to be filled bits >= 5. As such:
a. You must specify the length of your message (n*5 bits) or
b. You must specify only the length of the trailing bits (more efficient)
Compression algorithms might actually take an adverse hit by your custom packing - (depending on the type of the original data like text).
Upvotes: 1