compressed string storage

Lets say I have many objects containing strings of non-trivial length (around ~3-4kb). The strings are all different from each other yet at the same time contain lots of common parts/subsequences. On average maybe 80-90% of any individual string is contained withing the others as well. Is there an easy way to automatically exploit this huge redundancy for compressing the data?
Ideally the solution would be C++ and transparent for the user (i.e. I can use it as if I was accessing a regular read only const std::string but instead reading from compressed storage).

Upvotes: 6

Answers (4)

ruslik

Reputation: 14900

Like @Saeed mentioned, a simple Huffman coding will perform well here.

There is no need in dictionary, if the common words are known apriori (you've mentioned that it's a HTML). Just precompute a huffman table using statistical data from many HTML files (Note that you can encode whole tag by a single symbol, and you can have as many symbols as you want).

Upvotes: 1

SingleNegationElimination

Reputation: 156278

If the common parts of the strings are common because they are composed from other strings, then you might get some traction by using the stlport rope class, which looks for all the world like a std::string, but uses substring tree representation with copy on write that makes them both very space efficient (common substrings are shared) and very good at inserts and deletes (log(n))

When to use rope:

you are making a template engine. document instances are made from a template by substituting varying data in the template, and then cached for future uses. Parts that are common to templates and instances are stored only once and shared across instances, inserts and deletes are cheap.

When not to use rope:

you are loading many documents from outside the domain of your application (from disk, or over a network) and using them without modification. rope doesn't share strings if they are not copied from one rope to another. If you can afford to do the work to find the common substrings, rope can still be used to improve your final representations.

Upvotes: 2

Saeed Amiri

Reputation: 22565

You can use huffman coding implementation is not hard, Also there are zip algorithms in languages (like C# and java) and you can use them.

Also If you sure 80-90% are repeated in all, create a dictionary of all words, then for each string store the position of dictionary word, means have a bit array of big size (10000 i.e) and mark the related position bits[i] to 1 if a words[i] exists in the current string. think each word length is 5 character then the abbreviation takes around 1/5 size.

Upvotes: 3

NPE

Reputation: 500893

Algorithmically, Lempel–Ziv–Welch with one dictionary for all objects/strings might be a good start.

Upvotes: 3

compressed string storage

Answers (4)

Related Questions