Reputation: 19104

Simple API for random access into a compressed data file

Please recommend a technology suitable for the following task.

I have a rather big (500MB) data chunk, which is basically a matrix of numbers. The data entropy is low (it should be well-compressible) and the storage is expensive where it sits.

What I am looking for, is to compress it with a good compression algorithm (Like, say, GZip) with markers that would enable very occasional random access. Random access as in "read byte from location [64bit address] in the original (uncompressed) stream". This is a little different than the classic deflator libraries like ZLIB, which would let you decompress the stream continuously. What I would like, is have the random access at latency of, say, as much as 1MB of decompression work per byte read.

Of course, I hope to use existing library rather than reinvent the NIH wheel.

Upvotes: 6

Answers (8)

hoxnox

Reputation: 355

Take a look at my project - csio. I think it is exactly what you are looking for: stdio-like interface and multithreaded compressor included.

It is library, writen in C, which provides CFILE structure and functions cfopen, cfseek, cftello, and others. You can use it with regular (not compressed) files and with files, compressed with help of dzip utility. This utility included in the project and written in C++. It produces valid gzip archive, wich can be handled by standard utilities as well as with csio. dzip can compress in many threads (see -j option), so it can very fast compress very big files.

Tipical usage:

dzip -j4 myfile

...

CFILE file = cfopen("myfile.dz", "r");
off_t some_offset = 673820;
cfseek(file, some_offset);
char buf[100];
cfread(buf, 100, 1, file);
cfclose(file);

It is MIT licensed, so you can use it in your projects without restrictions. For more information visit project page on github: https://github.com/hoxnox/csio

Upvotes: 1

hippietrail

Reputation: 17023

You could use bzip2 and make your own API pretty easily based on the James Taylor's seek-bzip2

Upvotes: 0

jkff

Reputation: 17913

If you're working in Java, I just published a library for that: http://code.google.com/p/jzran.

Upvotes: 2

Jerry Coffin

Reputation: 490728

If you want to minimize the work involved, I'd just break the data into 1 MB (or whatever) chunks, then put the pieces into a PKZIP archive. You'd then need a tiny bit of front-end code to take a file offset, and divide by 1M to get the right file to decompress (and, obviously, use the remainder to get to the right offset in that file).

Edit: Yes, there is existing code to handle this. Recent versions of Info-zip's unzip (6.0 is current) include api.c. Among other things, that includes UzpUnzipToMemory -- you pass it the name of a ZIP file, and the name of one of the file in that archive that you want to retrieve. You then get a buffer holding the contents of that file. For updating, you'll need the api.c from zip3.0, using ZpInit and ZpArchive (though these aren't quite as simple to use as the unzip side).

Alternatively, you can just run a copy of zip/unzip in the background to do the work. This isn't quite as neat, but undoubtedly a bit simpler to implement (as well as allowing you to switch formats pretty easily if you choose).

Upvotes: 1

Marcus Adams

Reputation: 53880

Byte Pair Encoding allows random access to data.

You won't get as good compression with it, but you're sacrificing adaptive (variable) hash trees for a single tree, so you can access it.

However, you'll still need some kind of index in order to find a particular "byte". Since you're okay with 1 MB of latency, you'll be creating an index for every 1 MB. Hopefully you can figure out a way to make your index small enough to still benefit from the compression.

One of the benefits of this method is random access editing too. You can update, delete, and insert data in relatively small chunks.

If it's accessed rarely, you could compress the index with gzip and decode it when needed.

Upvotes: 1

Gustavo V

Reputation: 152

Sort the big file first
divide it in chunks of your desire size (1MB) with some sequence in the name (File_01, File_02, .., File_NN)
take first ID from each chunk plus the filename and put both data into another file
compress the chunks
you will able to made a search into the ID's file using the method that you wish, may be a binary search and open each file as you need.

If you need a deep Indexing you could use a BTree algorithm with the "pages" are the files. on the web exists several implementation of this because are little tricky the code.

Upvotes: 0

Elpezmuerto

Reputation: 5591

I would recommend using the Boost Iostreams Library. Boost.Iostreams can be used to create streams to access TCP connections or as a framework for cryptography and data compression. The library includes components for accessing memory-mapped files, for file access using operating system file descriptors, for code conversion, for text filtering with regular expressions, for line-ending conversion and for compression and decompression in the zlib, gzip and bzip2 formats.

The Boost library been accepted by the C++ standards committee as part of TR2 so it will eventually be built-in to most compilers (under std::tr2::sys). It is also cross-platform compatible.

Boost Releases

Boost Getting Started Guide NOTE: Only some parts of boost::iostreams are header-only library which require no separately-compiled library binaries or special treatment when linking.

Upvotes: 0

Edward Strange

Reputation: 40895

Compression algorithms usually work in blocks I think so you might be able to come up with something based on block size.

Upvotes: 0

Simple API for random access into a compressed data file

Answers (8)

Related Questions