Ameer
Ameer

Reputation: 2638

DeflateStream advancing underlying stream to end

I'm trying to read out git objects from a git pack file, following the format for pack files laid out here. Once I hit the compressed data I'm running into issues. I'm trying to use System.IO.Compression.DeflateStream to decompress the zlib compressed objects. I basically ignore the zlib headers by skipping over the first 2 bytes. These 2 bytes for the first object anyway are 789C. Now the trouble starts.

1) I only know the size of the decompressed objects. The Read method documentation on the DeflateStream states that it "Reads a number of decompressed bytes into the specified byte array." Which is what I want, however I do see people setting this count to the size of the compressed data, one of us is doing it wrong.

2) The data I'm getting back is correct, I think (human-readable data that looks right), however it's advancing the underlying stream I give it all the way to the end! For example I ask it for 187 decompressed bytes and reads the remaining 212 bytes all the way to the end of the stream. As in the whole stream is 228 bytes and the position of the stream at the end of the deflate read 187 bytes is now 228. I can't seek backwards, as I don't know where the end of the compressed data is, and also not all the streams I use would be seekable. Is this the expected behavior to consume the whole stream?

Upvotes: 1

Views: 977

Answers (2)

taka
taka

Reputation: 67

I was doing exactly the same thing as OP (reading git pack files), and managed to hack up a way around this problem.

As per Mark Adler's comment here, DeflateStream is indeed brain-dead and useless, because yes, it does read bytes beyond the compressed data. Looking through the source code here, it reads the input data in 8K blocks :-/

However, DeflateStream instances have a private member _inflater, which have a private member _zlibStream, which have a property AvailIn, which returns the number of bytes available in the input buffer. IOW, this is the number of bytes too many that have been read, so by using reflection to get at these private parts, we can move the file pointer backwards by that many bytes, to return it to where it should've been left i.e. just past the end of the compressed data.

This code is F#, but it should be clear what's going on:

// zstream is the DeflateStream instance
let inflater = typeof<DeflateStream>.GetField( "_inflater", BindingFlags.NonPublic ||| BindingFlags.Instance ).GetValue( zstream )
let zlibStream = inflater.GetType().GetField( "_zlibStream", BindingFlags.NonPublic ||| BindingFlags.Instance ).GetValue( inflater )
let availInMethod = zlibStream.GetType().GetProperty( "AvailIn" ).GetMethod
let availIn: uint32 = unbox( availInMethod.Invoke( zlibStream, null ) )
// inp is the input file
inp.Seek( -(int64 availIn) + 4L, SeekOrigin.Current ) |> ignore

I think the 4-byte adjustment is because zlib has a checksum in there...

Upvotes: 0

Peter Duniho
Peter Duniho

Reputation: 70701

According to the page you reference (I'm not familiar with this file format myself), each block of data is indexed by an offset field in the index for the file. Since you know the length of the type and data length fields that precedes each data block, and you know the offset of the next block, you also know the length of each data block (i.e. the length of the compressed bytes).

That is, the length of each data block is simply the offset of the next block minus the offset of the current block, then minus the length of the type and data length fields (however many bytes that is…according to the documentation, it's variable, but you can certainly compute that length as you read it).

So:

1) I only know the size of the decompressed objects. The Read method documentation on the DeflateStream states that it "Reads a number of decompressed bytes into the specified byte array." Which is what I want, however I do see people setting this count to the size of the compressed data, one of us is doing it wrong.

The documentation is correct. DeflateStream is a subclass of Stream, and has to follow that class's rules. Since the Read() method of Stream outputs the number of bytes requested, these must be uncompressed bytes.

Note that per the above, you do know the size of the compressed objects. It's not stored in the file, but you can derive that information from the things that are stored in the file.

2) The data I'm getting back is correct, I think (human-readable data that looks right), however it's advancing the underlying stream I give it all the way to the end! For example I ask it for 187 decompressed bytes and reads the remaining 212 bytes all the way to the end of the stream. As in the whole stream is 228 bytes and the position of the stream at the end of the deflate read 187 bytes is now 228. I can't seek backwards, as I don't know where the end of the compressed data is, and also not all the streams I use would be seekable. Is this the expected behavior to consume the whole stream?

Yes, I would expect that to happen. Or at a minimum, I would expect some buffering to happen, so even if it didn't read all the way to the end of the stream, I would expect it to read at least some number of bytes past the end of the compressed data.

It seems to me that you have at least a couple of options:

  1. For each block of data, compute the length of the data (per above), read that into a standalone MemoryStream object, and decompress the data from that stream rather than the original.
  2. Alternatively, go ahead and decompress directly from the source stream, using the offsets provided in the index to seek to each data block as you read it. Of course, this won't work with non-seekable streams, which you indicate occur in your scenario. So this option would not work for all cases in your scenario.

Upvotes: 0

Related Questions