Reputation: 5640
I have an application for storing many strings in a TStringList. The strings will be largely similar to one another and it occurs to me that one could compress them on the fly - i.e. store a given string in terms of a mixture of unique text fragments plus references to previously stored fragments. StringLists such as lists of fully-qualified path and filenames should be able to be compressed greatly.
Does anyone know of a TStringlist descendant that implement this - i.e. provides read and write access to the uncompressed strings but stores them internally compressed, so that a TStringList.SaveToFile produces a compressed file?
While you could implement this by uncompressing the entire stringlist before each access and re-compressing it afterwards, it would be unnecessarily slow. I'm after something that is efficient for incremental operations and random "seeks" and reads.
TIA Ross
Upvotes: 3
Views: 1105
Reputation: 4412
I dont think you really want to compress TStrings items in memory, because it terribly ineffecient. I suggest you to look at TStream implementation in Zlib unit. Just wrap regular stream into TDecompressionStream on load and TCompressionStream on save (you can even emit gzip header there). Hint: you will want to override LoadFromStream/SaveToStream instead of LoadFromFile/SaveToFile
Upvotes: 0
Reputation: 4164
I don't think there's any freely available implementation around for this (not that I know of anyway, although I've written at least 3 similar constructs in commercial code), so you'd have to roll your own.
The remark Marcelo made about adding items in order is very relevant, as I suppose you'll probably want to compress the data at addition time - having quick access to entries already similar to the one being added, gives a much better performance than having to look up a 'best fit entry' (needed for similarity-compression) over the entire set.
Another thing you might want to read up about, are 'ropes' - a conceptually different type than strings, which I already suggested to Marco Cantu a while back. At the cost of a next-pointer per 'twine' (for lack of a better word) you can concatenate parts of a string without keeping any duplicate data around. The main problem is how to retrieve the parts that can be combined into a new 'rope', representing your original string. Once that problem is solved, you can reconstruct the data as a string at any time, while still having compact storage.
If you don't want to go the 'rope' route, you could also try something called 'prefix reduction', which is a simple form of compression - just start out each string with an index of a previous string and the number of characters that should be treated as a prefix for the new string. Be aware that you should not recurse this too far back, or access-speed will suffer greatly. In one simple implementation, I did a mod 16
on the index, to establish the entry at which prefix-reduction started, which gave me on average about 40% memory savings (this number is completely data-dependant of course).
Upvotes: 2
Reputation: 2401
I suppose you expect general flexibility from the list (including delete operation), in this case I don't know about any out of the box solution, but I'd suggest one of the two approaches:
You split your string into words and keep separated growning dictionary to reference the words and save list of indexes internally
You implement something related to zlib stream available in Delphi, but operating by the block that for example can contains 10-100 strings. In this case you still have to recompress/compress the complete block, but the "price" you pay is lower.
Upvotes: 0
Reputation: 186098
You could try to wrap a Delphi or COM API around Judy arrays. The JudySL type would do the trick, and has a fairly simple interface.
EDIT: I assume you are storing unique strings and want to (or are happy to) store them in lexicographical order. If these constraints aren't acceptable, then Judy arrays are not for you. Mind you, any compression system will suffer if you don't sort your strings.
Upvotes: 0