Read AWS S3 GZIP Object using GetObjectRequest with range

Question

I am trying to read a big AWS S3 Compressed Object(gz).I don't want to read the whole object, want to read it in parts,so that i can process the uncompressed data in parallel I am reading it with GetObjectRequest with "Range" Header, where i am setting byte range. However, when i give a byte range in between (100,200), it fails with "Not in GZIP format" The reason for failure is , AWS request return a stream,however when i parse it to GZIPInputStream it fails as "GZIPInputStream" expects the first byte (GZIP_MAGIC = 0x8b1f) to confirm is it gzip , which is not present in the stream.

   GetObjectRequest rangeObjectRequest = new GetObjectRequest(<>, <>).withRange(100, 200);
   S3Object object = s3Client.getObject(rangeObjectRequest);
   S3ObjectInputStream rawData = object.getObjectContent();
   InputStream data =  new GZIPInputStream(rawData);

Can anyone guide the right approach?

Parsifal · Accepted Answer

GZIP is a compression format in which each byte in the file depends on all of the bytes that precede it. Which means that you can't pick an arbitrary byte range out of the file and make sense of it.

If you need to read byte ranges, you'll need to store it uncompressed.

You could also create your own file storage format that stores chunks of the file as separately-compressed blocks. You could do this using the ZIP format, where each file in the archive represents a specific block size. But you'd need to implement your own ZIP directory reader to make that work.

Read AWS S3 GZIP Object using GetObjectRequest with range

Answers (1)

Related Questions