Reputation: 13279
One of the neat characteristics of UTF-8 is that if you compare two strings (with <) byte-by-byte, you get the same answer as if you had compared them codepoint-by-codepoint. I was wondering if there was a similar encoding that was optimal in size (e.g. UTF-8 "wastes" space by tagging bytes with 10xxxxxx if they are not the first byte representing a codepoint).
The assumption for optimality here is that a non-negative number n is more frequent than a number m if n < m.
I am most interested in knowing if there is a (byte-comparable) encoding that works for integers, with n more frequent than m if |n| < |m|.
Upvotes: 10
Views: 636
Reputation: 1741
Levenshtein coding is asymptotically optimal and preserves the ordering. It also has short codes for zero (0) and one (10) . It's not that great for numbers in between (like 256 and 65536). But it's also not that hard to implement.
0 0
1 10
2 1100
3 1101
4 1110000
5 1110001
6 1110010
7 1110011
8 11101000
9 11101001
10 11101010
11 11101011
12 11101100
13 11101101
14 11101110
15 11101111
16 111100000000
17 111100000001
18 111100000010
19 111100000011
20 111100000100
21 111100000101
22 111100000110
23 111100000111
24 111100001000
25 111100001001
26 111100001010
27 111100001011
28 111100001100
29 111100001101
30 111100001110
31 111100001111
Upvotes: 0
Reputation: 165
Have you considered a variant of Huffman coding? Traditionally one recursively merges the two least frequent symbols, but to preserve order one could instead merge the two adjacent symbols having the least sum.
Looks like this problem has been well-studied (and the greedy algorithm is not optimal). The optimal algorithm was given by Hu and Tucker, which is described here and more detail in this thesis.
This paper discussing order-preserving dictionary-based compression also looks interesting.
Upvotes: 3
Reputation: 19114
There are very few standard encodings and the answer is no. Any further optimization beyond UTF-8 should not be referred to as "encoding" but a "compression" - and lexicographically-comparable compression is a different department.
If you are solving a real-world (non-purely-academic) problem, I'd just stick with the most standard UTF8. You can learn about its efficiency compared to other standard encodings on utf8everywhere.org.
Upvotes: 1
Reputation: 33283
To fully answer that question you need to know the frequency of the codepoints in the material. UTF-8 is optimal for texts in English as multi-byte characters are very rare in typical English text.
To encode integers using UTF-8 as a base algorithm would entail mapping the first n integers to a 1-byte encoding, the next m to a 2-byte encoding and so on. Whether that is an optimal encoding depends on the distribution. If the first n numbers are very frequent compared to higher numbers, then UTF-8 would be (near) optimal.
Upvotes: 0