Reputation: 6686
I have long list of short strings that I want to compress, but I want to be able to decompress an arbitrary string in the list at any time without decompressing the entire list.
I know the list ahead of time and it doesn't matter how much preprocessing is involved. It is also fine if there is some significant O(1) memory overhead.
I realize that I could just compress each string independently with some lossless compression algorithm, but that won't work very well because the strings are very short and each one does not contain much redundancy. Over the entire, however, list there is a lot of redundancy.
Upvotes: 2
Views: 654
Reputation: 112209
I would recommend compressing about 64K worth of strings at a time (about 32 of your strings), requiring that you decompress only 16 strings on average to get the one you want. As opposed to 1,000,000. You will get almost the same compression with deflate (the compression method used by gzip).
An alternative, also using deflate, would be to construct a 32K "dictionary", which consists of the most commonly seen sub-strings in your 2,000,000 strings. Then each string can be compressed individually using that 32K from which to draw matches. If your strings have that sort of commonality, then you can get close to the same compression. (See zlib's deflateSetDictionary()
and inflateSetDictionary()
functions.)
Upvotes: 2