U2EF1
U2EF1

Reputation: 13279

Optimal integer encoding that still sorts

One of the neat characteristics of UTF-8 is that if you compare two strings (with <) byte-by-byte, you get the same answer as if you had compared them codepoint-by-codepoint. I was wondering if there was a similar encoding that was optimal in size (e.g. UTF-8 "wastes" space by tagging bytes with 10xxxxxx if they are not the first byte representing a codepoint).

The assumption for optimality here is that a non-negative number n is more frequent than a number m if n < m.

I am most interested in knowing if there is a (byte-comparable) encoding that works for integers, with n more frequent than m if |n| < |m|.

Upvotes: 10

Views: 636

Answers (4)

Wolfgang Brehm
Wolfgang Brehm

Reputation: 1741

Levenshtein coding is asymptotically optimal and preserves the ordering. It also has short codes for zero (0) and one (10) . It's not that great for numbers in between (like 256 and 65536). But it's also not that hard to implement.

     0 0
     1 10
     2 1100
     3 1101
     4 1110000
     5 1110001
     6 1110010
     7 1110011
     8 11101000
     9 11101001
    10 11101010
    11 11101011
    12 11101100
    13 11101101
    14 11101110
    15 11101111
    16 111100000000
    17 111100000001
    18 111100000010
    19 111100000011
    20 111100000100
    21 111100000101
    22 111100000110
    23 111100000111
    24 111100001000
    25 111100001001
    26 111100001010
    27 111100001011
    28 111100001100
    29 111100001101
    30 111100001110
    31 111100001111

Upvotes: 0

Erick Wong
Erick Wong

Reputation: 165

Have you considered a variant of Huffman coding? Traditionally one recursively merges the two least frequent symbols, but to preserve order one could instead merge the two adjacent symbols having the least sum.

Looks like this problem has been well-studied (and the greedy algorithm is not optimal). The optimal algorithm was given by Hu and Tucker, which is described here and more detail in this thesis.

This paper discussing order-preserving dictionary-based compression also looks interesting.

Upvotes: 3

Pavel Radzivilovsky
Pavel Radzivilovsky

Reputation: 19114

There are very few standard encodings and the answer is no. Any further optimization beyond UTF-8 should not be referred to as "encoding" but a "compression" - and lexicographically-comparable compression is a different department.

If you are solving a real-world (non-purely-academic) problem, I'd just stick with the most standard UTF8. You can learn about its efficiency compared to other standard encodings on utf8everywhere.org.

Upvotes: 1

Klas Lindb&#228;ck
Klas Lindb&#228;ck

Reputation: 33283

To fully answer that question you need to know the frequency of the codepoints in the material. UTF-8 is optimal for texts in English as multi-byte characters are very rare in typical English text.

To encode integers using UTF-8 as a base algorithm would entail mapping the first n integers to a 1-byte encoding, the next m to a 2-byte encoding and so on. Whether that is an optimal encoding depends on the distribution. If the first n numbers are very frequent compared to higher numbers, then UTF-8 would be (near) optimal.

Upvotes: 0

Related Questions