Reputation: 2638
I'm trying to read out git objects from a git pack file, following the format for pack files laid out here. Once I hit the compressed data I'm running into issues. I'm trying to use System.IO.Compression.DeflateStream to decompress the zlib compressed objects. I basically ignore the zlib headers by skipping over the first 2 bytes. These 2 bytes for the first object anyway are 789C. Now the trouble starts.
1) I only know the size of the decompressed objects. The Read method documentation on the DeflateStream states that it "Reads a number of decompressed bytes into the specified byte array." Which is what I want, however I do see people setting this count to the size of the compressed data, one of us is doing it wrong.
2) The data I'm getting back is correct, I think (human-readable data that looks right), however it's advancing the underlying stream I give it all the way to the end! For example I ask it for 187 decompressed bytes and reads the remaining 212 bytes all the way to the end of the stream. As in the whole stream is 228 bytes and the position of the stream at the end of the deflate read 187 bytes is now 228. I can't seek backwards, as I don't know where the end of the compressed data is, and also not all the streams I use would be seekable. Is this the expected behavior to consume the whole stream?
Upvotes: 1
Views: 977
Reputation: 67
I was doing exactly the same thing as OP (reading git pack files), and managed to hack up a way around this problem.
As per Mark Adler's comment here, DeflateStream
is indeed brain-dead and useless, because yes, it does read bytes beyond the compressed data. Looking through the source code here, it reads the input data in 8K blocks :-/
However, DeflateStream
instances have a private member _inflater
, which have a private member _zlibStream
, which have a property AvailIn
, which returns the number of bytes available in the input buffer. IOW, this is the number of bytes too many that have been read, so by using reflection to get at these private parts, we can move the file pointer backwards by that many bytes, to return it to where it should've been left i.e. just past the end of the compressed data.
This code is F#, but it should be clear what's going on:
// zstream is the DeflateStream instance
let inflater = typeof<DeflateStream>.GetField( "_inflater", BindingFlags.NonPublic ||| BindingFlags.Instance ).GetValue( zstream )
let zlibStream = inflater.GetType().GetField( "_zlibStream", BindingFlags.NonPublic ||| BindingFlags.Instance ).GetValue( inflater )
let availInMethod = zlibStream.GetType().GetProperty( "AvailIn" ).GetMethod
let availIn: uint32 = unbox( availInMethod.Invoke( zlibStream, null ) )
// inp is the input file
inp.Seek( -(int64 availIn) + 4L, SeekOrigin.Current ) |> ignore
I think the 4-byte adjustment is because zlib has a checksum in there...
Upvotes: 0
Reputation: 70701
According to the page you reference (I'm not familiar with this file format myself), each block of data is indexed by an offset field in the index for the file. Since you know the length of the type and data length fields that precedes each data block, and you know the offset of the next block, you also know the length of each data block (i.e. the length of the compressed bytes).
That is, the length of each data block is simply the offset of the next block minus the offset of the current block, then minus the length of the type and data length fields (however many bytes that is…according to the documentation, it's variable, but you can certainly compute that length as you read it).
So:
1) I only know the size of the decompressed objects. The Read method documentation on the DeflateStream states that it "Reads a number of decompressed bytes into the specified byte array." Which is what I want, however I do see people setting this count to the size of the compressed data, one of us is doing it wrong.
The documentation is correct. DeflateStream
is a subclass of Stream
, and has to follow that class's rules. Since the Read()
method of Stream
outputs the number of bytes requested, these must be uncompressed bytes.
Note that per the above, you do know the size of the compressed objects. It's not stored in the file, but you can derive that information from the things that are stored in the file.
2) The data I'm getting back is correct, I think (human-readable data that looks right), however it's advancing the underlying stream I give it all the way to the end! For example I ask it for 187 decompressed bytes and reads the remaining 212 bytes all the way to the end of the stream. As in the whole stream is 228 bytes and the position of the stream at the end of the deflate read 187 bytes is now 228. I can't seek backwards, as I don't know where the end of the compressed data is, and also not all the streams I use would be seekable. Is this the expected behavior to consume the whole stream?
Yes, I would expect that to happen. Or at a minimum, I would expect some buffering to happen, so even if it didn't read all the way to the end of the stream, I would expect it to read at least some number of bytes past the end of the compressed data.
It seems to me that you have at least a couple of options:
MemoryStream
object, and decompress the data from that stream rather than the original.Upvotes: 0