Migwell
Migwell

Reputation: 20137

Does Apache Arrow support separately-compressed chunks?

In bioinformatics we have the bgzip file, which is block-compressed, meaning that you can compress a file (let's say a CSV), and then if you want to access some data in the middle of that file, you can decompress only the middle chunk, rather than the entire file.

As is explained here, Arrow (and therefore Feather v2, the file format) seems to support chunked reads and writes, and also compression. However it isn't clear if the compression applies to the entire file, or if individual chunks can be decompressed. This is my questions: can we separately compress chunks of an Arrow/Feather v2 and then later decompress a single chunk without decompressing everything?

Upvotes: 1

Views: 407

Answers (1)

li.davidm
li.davidm

Reputation: 12126

The compression is applied to individual buffers in each RecordBatch, i.e. yes, you still get random access to each of the record batches in the file. I see this is not documented in the user docs but it is present in the format where compression is specified for each RecordBatch.

Upvotes: 2

Related Questions