Hal
Hal

Reputation: 968

What's the most compact way of encoding lists of symbols from a finite set?

I'm interested in representing a sequence of symbols from a finite set in the least number of bytes.

For example, say you had a text string which only contained the characters a-z. You could encode them as ascii, so 1 byte per symbol (character). However, by doing that you're only using 26 of the possible 256 values per byte.

I've coded a solution which seems to work well, but I'd like to know if anyone knows or can think of a better way.

My method is to treat the sequence as an integer in base n, where n is the size of the set of symbols + 1. For example, if your set or symbols, or "alphabet" was {a, b, c} (length 3) then we'd use base 4. The symbols are assigned numerical values so {a => 1, b => 2, c => 3}. Therefore, the sequence [b, a, c] is treated as the number 213 in base 4, so 39 in decimal. This integer can be encoded in binary, and decoded back to its base 4 representation to retrieve the sequence 2, 1, 3 => [b, a, c].

My Python implementation of the above: radixcodec.py

So my question is, is there a more space efficient method of encoding lists of elements from a finite set than the one I've described?

Upvotes: 2

Views: 1091

Answers (1)

ᅠᅠᅠ
ᅠᅠᅠ

Reputation: 67000

Use base n where n is the number of the symbols (e.g. {a => 0, b => 1, c => 2}). That method is optimal if each symbol is equally likely to appear. (You'll also have to store the length of the string, of course. By the way, your implementation uses Python strings; those are definitely not the most space-efficient data structure you can find.)

If the frequencies of the symbols vary, and you know them, you can use Huffman coding. If you don't know the frequencies, there's adaptive Huffman coding.

Anyway, the best method will depend on the application.

Upvotes: 4

Related Questions