Does Apache Arrow support separately-compressed chunks?

Question

In bioinformatics we have the bgzip file, which is block-compressed, meaning that you can compress a file (let's say a CSV), and then if you want to access some data in the middle of that file, you can decompress only the middle chunk, rather than the entire file.

As is explained here, Arrow (and therefore Feather v2, the file format) seems to support chunked reads and writes, and also compression. However it isn't clear if the compression applies to the entire file, or if individual chunks can be decompressed. This is my questions: can we separately compress chunks of an Arrow/Feather v2 and then later decompress a single chunk without decompressing everything?

li.davidm · Accepted Answer

The compression is applied to individual buffers in each RecordBatch, i.e. yes, you still get random access to each of the record batches in the file. I see this is not documented in the user docs but it is present in the format where compression is specified for each RecordBatch.

Does Apache Arrow support separately-compressed chunks?

Answers (1)

Related Questions