Suggestions on how to order a list of strings to increase compression performance?

Question

Say you have a huge set of strings:

{'Ape','Bed','Car','Desk','Edge',...}

and you have to write the set to a file. But the order in which the strings appear is arbitrary (you only need to store/retrieve the set).

Since the set can be huge, you want to compress it, for instance using gzip. I am interested in ways to sort the set such that compression rates are optimized. Alphabetically sorting looks like a valuable candidate since it means that if gzip compresses a "frame", it will see a lot of strings that start with the same substring: in the beginning it could for instance encode AAA shorter where in the end ZZY might be encoded short.

I already ran a few tests by sorting strings. For instance a test-set strings (92 274 strings) generates:

-rw-rw-r-- 1 willem willem 296K Jan 19 18:52 strings_norm.gz
-rw-rw-r-- 1 willem willem 419K Jan 19 18:52 strings_shuf.gz

-rw-rw-r-- 1 willem willem 241K Jan 19 19:00 strings_norm.7z
-rw-rw-r-- 1 willem willem 374K Jan 19 19:01 strings_shuf.7z

Where _norm is the sorted list and _shuf is its shuffled counterpart. The strings are simply encoded originally with utf-8 and new lines between the words. A more accessible test set is probably the American dictionary you find on most Linux distros (probably located in /usr/share/dict/american-english; 99 171 words):

-rw-rw-r-- 1 willem willem 208K Jan 19 19:40 american_norm.7z
-rw-rw-r-- 1 willem willem 250K Jan 19 19:39 american_norm.gz
-rw-rw-r-- 1 willem willem 352K Jan 19 19:40 american_shuf.7z
-rw-rw-r-- 1 willem willem 439K Jan 19 19:40 american_shuf.gz

These are generated with:

sort /usr/share/dict/american-english | gzip > american_norm.gz
shuf /usr/share/dict/american-english | gzip > american_shuf.gz
sort /usr/share/dict/american-english | 7z a american_norm.7z -si
shuf /usr/share/dict/american-english | 7z a american_shuf.7z -si

As you can see we gain an additional 29%-43% with gzip and 35%-40% with 7z. A first remark is that by sorting both compression algorithms perform better so the hypothesis is that although compression rates will differ one can say there are "good" orderings and "bad" ones: ones where nearly all compression algorithms achieve better rates than with other orderings. Of course it is possible that a "shuffled" dataset will perform better, but the odds are probably not that high. I was wondering if there exist more advanced methods. After all by sorting the strings will likely start with the same substring, but another way to order them might find strings that have a large substring in common, like apple, apples, dapple, pineapple, etc.

The problem is probably - like most interesting problems - NP-hard, but I was wondering if there exist some heuristics that we can use to optimize string compression given the order of the string does not matter.

Suggestions on how to order a list of strings to increase compression performance?

Answers (0)

Related Questions