Torxed
Torxed

Reputation: 23480

Content-Encoding: gzip + Transfer-Encoding: chunked with gzip/zlib gives incorrect header check

How do you manage chunked data with gzip encoding? I have a server which sends data in the following manner:

HTTP/1.1 200 OK\r\n
...
Transfer-Encoding: chunked\r\n
Content-Encoding: gzip\r\n
\r\n
1f50\r\n\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec}\xebr\xdb\xb8\xd2\xe0\xef\xb8\xea\xbc\x03\xa2\xcc\x17\xd9\xc7\xba\xfa\x1e\xc9r*\x93\xcbL\xf6\xcc\x9c\xcc7\xf1\x9c\xf9\xb6r\xb2.H ... L\x9aFs\xe7d\xe3\xff\x01\x00\x00\xff\xff\x03\x00H\x9c\xf6\xe93\x00\x01\x00\r\n0\r\n\r\n

I've had a few different approaches to this but there's something i'm forgetting here.

data = b''
depleted = False
while not depleted:
    depleted = True
    for fd, event in poller.poll(2.0):
        depleted = False
        if event == select.EPOLLIN:
            tmp = sock.recv(8192)
            data += zlib.decompress(tmp, 15 + 32)

Gives (also tried decoding only data after \r\n\r\n obv):
zlib.error: Error -3 while decompressing data: incorrect header check

So I figured the data should be decompressed once the data has been recieved in it's whole format..

        ...
        if event == select.EPOLLIN:
            data += sock.recv(8192)
data = zlib.decompress(data.split(b'\r\n\r\n',1)[1], 15 + 32)

Same error. Also tried decompressing data[:-7] because of the chunk ID at the very end of the data and with data[2:-7] and other various combinations, but with the same error.

I've also tried the gzip module via:

with gzip.GzipFile(fileobj=Bytes(data), 'rb') as fh:
    fh.read()

But that gives me "Not a gzipped file".

Even after recording down the data as recieved by the servers (headers + data) down into a file, and then creating a server-socket on port 80 serving the data (again, as is) to the browser it renders perfectly so the data is intact. I took this data, stripped off the headers (and nothing else) and tried gzip on the file: enter image description here

Thanks to @mark-adler I produced the following code to un-chunk the chunked data:

unchunked = b''
pos = 0
while pos <= len(data):
    chunkLen = int(binascii.hexlify(data[pos:pos+2]), 16)
    unchunked += data[pos+2:pos+2+chunkLen]
    pos += 2+len('\r\n')+chunkLen

with gzip.GzipFile(fileobj=BytesIO(data[:-7])) as fh:
    data = fh.read()

This produces OSError: CRC check failed 0x70a18ee9 != 0x5666e236 which is one step closer. In short I clip the data according to these four parts:

I'm probably getting there, but not close enough.

Footnote: Yes, the socket is far from optimal, but it looks this way because i thought i didn't get all the data from the socket so i implemented a huge timeout and a attempt at a fail-safe with depleted :)

Upvotes: 1

Views: 4704

Answers (2)

Torxed
Torxed

Reputation: 23480

@mark-adler gave me some good pointers on how the chunked mode in the HTML protocol works, besides this i fiddled around with different ways of unzipping the data.

  1. You're supposed to stitch the chunks into one big heap
  2. You're supposed to use gzip not zlib
  3. You can only unzip the entire stitched chunks, doing it in parts will not work

Here's the solution for all three of the above problems:

unchunked = b''
pos = 0
while pos <= len(data):
    chunkNumLen = data.find(b'\r\n', pos)-pos
#   print('Chunk length found between:',(pos, pos+chunkNumLen))
    chunkLen=int(data[pos:pos+chunkNumLen], 16)
#   print('This is the chunk length:', chunkLen)
    if chunkLen == 0:
#       print('The length was 0, we have reached the end of all chunks')
        break
    chunk = data[pos+chunkNumLen+len('\r\n'):pos+chunkNumLen+len('\r\n')+chunkLen]
#   print('This is the chunk (Skipping',pos+chunkNumLen+len('\r\n'),', grabing',len(chunk),'bytes):', [data[pos+chunkNumLen+len('\r\n'):pos+chunkNumLen+len('\r\n')+chunkLen]],'...',[data[pos+chunkNumLen+len('\r\n')+chunkLen:pos+chunkNumLen+len('\r\n')+chunkLen+4]])
    unchunked += chunk
    pos += chunkNumLen+len('\r\n')+chunkLen+len('\r\n')

with gzip.GzipFile(fileobj=BytesIO(unchunked)) as fh:
    unzipped = fh.read()

return unzipped

I left the debug output in there but uncommented for a reason.
It was extremely useful even tho it looks like a mess to get what data you/i was actually trying to decompress and which parts was fetched where and which values each calculation brings fourth.

This code will walk through the chunked data with the following format:
<chunk length o' X bytes> \r\n <chunk> \r\n

Had to be careful when first of all extracting the X bytes as they came in 1f50 which i first had to use binascii.hexlify(data[0:4]) on before putting it into int(), not sure why i don't need that anymore because i needed it in order to get a length of ~8000 before but then it gave me a REALLY big number all of a sudden which was't logical even tho i didn't really give it any other data.. anyway. After that it was just a matter of making sure the numbers were correct and then combine all the chunks into one hughe pile of gzip data and feed that into .GzipFile(...).

Edit 3 years later:

I'm aware that this was a client-side problem at first, but here's a server-side function to send a some what functional test:

def http_gzip(data):
    compressed = gzip.compress(data)

    # format(49, 'x') returns `31` which is `\x31` but without the `\x` notation.
    # basically the same as `hex(49)` but ment for these kind of things.
    return bytes(format(len(compressed), 'x')),'UTF-8') + b'\r\n' + compressed + b'\r\n0\r\n\r\n'

Upvotes: 1

Mark Adler
Mark Adler

Reputation: 112384

You can't split on \r\n since the compressed data may contain, and if long enough, certainly will contain that sequence. You need to dechunk first using the length provided (e.g. the first length 1f50) and feed the resulting chunks to decompress. The compressed data starts with the \x1f\x8b.

The chunking is hex number, crlf, chunk with that many bytes, crlf, hex number, crlf, chunk, crlf, ..., last chunk (of zero length), [possibly some headers], crlf.

Upvotes: 3

Related Questions