Reputation: 5454
Lets say I have many objects containing strings of non-trivial length (around ~3-4kb). The strings are all different from each other yet at the same time contain lots of common parts/subsequences. On average maybe 80-90% of any individual string is contained withing the others as well. Is there an easy way to automatically exploit this huge redundancy for compressing the data?
Ideally the solution would be C++ and transparent for the user (i.e. I can use it as if I was accessing a regular read only const std::string but instead reading from compressed storage).
Upvotes: 6
Views: 1524
Reputation: 14870
Like @Saeed mentioned, a simple Huffman coding will perform well here.
There is no need in dictionary, if the common words are known apriori (you've mentioned that it's a HTML). Just precompute a huffman table using statistical data from many HTML files (Note that you can encode whole tag by a single symbol, and you can have as many symbols as you want).
Upvotes: 1
Reputation: 156138
If the common parts of the strings are common because they are composed from other strings, then you might get some traction by using the stlport rope
class, which looks for all the world like a std::string, but uses substring tree representation with copy on write that makes them both very space efficient (common substrings are shared) and very good at inserts and deletes (log(n))
When to use rope:
When not to use rope:
Upvotes: 2
Reputation: 22555
You can use huffman coding implementation is not hard, Also there are zip algorithms in languages (like C# and java) and you can use them.
Also If you sure 80-90% are repeated in all, create a dictionary of all words, then for each string store the position of dictionary word, means have a bit array of big size (10000 i.e) and mark the related position bits[i]
to 1
if a words[i]
exists in the current string. think each word length is 5 character then the abbreviation takes around 1/5 size.
Upvotes: 3
Reputation: 500167
Algorithmically, Lempel–Ziv–Welch with one dictionary for all objects/strings might be a good start.
Upvotes: 3