Zachary Vance
Zachary Vance

Reputation: 894

Random-access compression in common use?

Is there a standard library or unix tool used for random-access compression, read-only or read-write? By random-access, I mean that you can read or write any part of the compressed content.

There are a LOT of streaming tools (gzip, xz, etc) and a few archive member-based tools (zip) but I'm only aware of academic work on random-access generally. The main problem with archives is that they typically compress each file individually (no de-duplication across files).

Somewhat related to Compression formats with good support for random access within archives? but re-asking because it's 10 years later.

Upvotes: 1

Views: 990

Answers (1)

Mark Adler
Mark Adler

Reputation: 112547

For the "read" part of your question, you can build an index to a large compressed file that creates a set of entry points into the compressed data. The number and density of those entry points determine how much data you need to decompress before you get to what you're looking for, and hence how fast the random access is. Note that all compressed data can be accessed randomly, since you always have one entry point at the start. You just decompress until you get what you want. Random read access is then not a question of ability, but a question of speed.

zran.c provides an example of building such an index for a gzip or zlib stream. You could do this for any compressed data format. Also pigz will create a gzip stream with marked entry points using the --independent option. The bzip2 format already has marked entry points for each block, where the entries are a few hundred K bytes of uncompressed data apart.

As for "or write", that's a whole different question. I am not aware of a format that facilitates random writing to compressed data. Generally, to get any level of decent compression, compressed data depends on all the data that precedes it. So if you write in the middle, you would need to decompress and recompress all of the data that follows.

To have true random write access, the format would have to solve two problems. The first would be to chunk the compressed data, like bzip2 or pigz's independent options do, to break the dependency on previous data. If you don't do that too often, there will only be a small impact to the compression ratio. Then you could take a chunk, decompress it, do your random write, and then recompress it. The smaller the chunks, the faster this random write access would be, but you have to trade that off with the compression ratio impact of smaller chunks.

The second problem to solve is the same one that file systems have, which is permitting fragmentation and non-sequential order. You want to avoid having to rewrite all of the subsequent chunks in the compressed data, due to the new randomly written chunk being smaller or larger. If you did rewrite, you're back to the random write taking a time proportional to the size of the compressed stream. The solution would be writing your own little file system built into the compressed data format that permits the chunks to not be in order in the stream, and for there to be unused gaps in the stream. I know of no format that does this, but it is possible in principle to implement.

If you limit your write access to appending, as would be the case for example for compressed logs, there are solutions for existing formats. gzlog is an example for appending to a gzip stream, allowing small amounts of data to be efficiently appended, leaving a valid gzip stream after each append.

Upvotes: 1

Related Questions