Reputation: 23480
How do you manage chunked data with gzip encoding? I have a server which sends data in the following manner:
HTTP/1.1 200 OK\r\n
...
Transfer-Encoding: chunked\r\n
Content-Encoding: gzip\r\n
\r\n
1f50\r\n\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec}\xebr\xdb\xb8\xd2\xe0\xef\xb8\xea\xbc\x03\xa2\xcc\x17\xd9\xc7\xba\xfa\x1e\xc9r*\x93\xcbL\xf6\xcc\x9c\xcc7\xf1\x9c\xf9\xb6r\xb2.H ... L\x9aFs\xe7d\xe3\xff\x01\x00\x00\xff\xff\x03\x00H\x9c\xf6\xe93\x00\x01\x00\r\n0\r\n\r\n
I've had a few different approaches to this but there's something i'm forgetting here.
data = b''
depleted = False
while not depleted:
depleted = True
for fd, event in poller.poll(2.0):
depleted = False
if event == select.EPOLLIN:
tmp = sock.recv(8192)
data += zlib.decompress(tmp, 15 + 32)
Gives (also tried decoding only data after \r\n\r\n
obv):
zlib.error: Error -3 while decompressing data: incorrect header check
So I figured the data should be decompressed once the data has been recieved in it's whole format..
...
if event == select.EPOLLIN:
data += sock.recv(8192)
data = zlib.decompress(data.split(b'\r\n\r\n',1)[1], 15 + 32)
Same error. Also tried decompressing data[:-7]
because of the chunk ID at the very end of the data and with data[2:-7]
and other various combinations, but with the same error.
I've also tried the gzip
module via:
with gzip.GzipFile(fileobj=Bytes(data), 'rb') as fh:
fh.read()
But that gives me "Not a gzipped file".
Even after recording down the data as recieved by the servers (headers + data) down into a file, and then creating a server-socket on port 80 serving the data (again, as is) to the browser it renders perfectly so the data is intact.
I took this data, stripped off the headers (and nothing else) and tried gzip on the file:
Thanks to @mark-adler I produced the following code to un-chunk the chunked data:
unchunked = b''
pos = 0
while pos <= len(data):
chunkLen = int(binascii.hexlify(data[pos:pos+2]), 16)
unchunked += data[pos+2:pos+2+chunkLen]
pos += 2+len('\r\n')+chunkLen
with gzip.GzipFile(fileobj=BytesIO(data[:-7])) as fh:
data = fh.read()
This produces OSError: CRC check failed 0x70a18ee9 != 0x5666e236
which is one step closer. In short I clip the data according to these four parts:
<chunk length o' X bytes>
\r\n
<chunk>
\r\n
I'm probably getting there, but not close enough.
Footnote: Yes, the socket is far from optimal, but it looks this way because i thought i didn't get all the data from the socket so i implemented a huge timeout and a attempt at a fail-safe with depleted
:)
Upvotes: 1
Views: 4704
Reputation: 23480
@mark-adler gave me some good pointers on how the chunked mode in the HTML protocol works, besides this i fiddled around with different ways of unzipping the data.
gzip
not zlib
Here's the solution for all three of the above problems:
unchunked = b''
pos = 0
while pos <= len(data):
chunkNumLen = data.find(b'\r\n', pos)-pos
# print('Chunk length found between:',(pos, pos+chunkNumLen))
chunkLen=int(data[pos:pos+chunkNumLen], 16)
# print('This is the chunk length:', chunkLen)
if chunkLen == 0:
# print('The length was 0, we have reached the end of all chunks')
break
chunk = data[pos+chunkNumLen+len('\r\n'):pos+chunkNumLen+len('\r\n')+chunkLen]
# print('This is the chunk (Skipping',pos+chunkNumLen+len('\r\n'),', grabing',len(chunk),'bytes):', [data[pos+chunkNumLen+len('\r\n'):pos+chunkNumLen+len('\r\n')+chunkLen]],'...',[data[pos+chunkNumLen+len('\r\n')+chunkLen:pos+chunkNumLen+len('\r\n')+chunkLen+4]])
unchunked += chunk
pos += chunkNumLen+len('\r\n')+chunkLen+len('\r\n')
with gzip.GzipFile(fileobj=BytesIO(unchunked)) as fh:
unzipped = fh.read()
return unzipped
I left the debug output in there but uncommented for a reason.
It was extremely useful even tho it looks like a mess to get what data you/i was actually trying to decompress and which parts was fetched where and which values each calculation brings fourth.
This code will walk through the chunked data with the following format:
<chunk length o' X bytes>
\r\n
<chunk>
\r\n
Had to be careful when first of all extracting the X bytes
as they came in 1f50
which i first had to use binascii.hexlify(data[0:4])
on before putting it into int()
, not sure why i don't need that anymore because i needed it in order to get a length of ~8000 before but then it gave me a REALLY big number all of a sudden which was't logical even tho i didn't really give it any other data.. anyway.
After that it was just a matter of making sure the numbers were correct and then combine all the chunks into one hughe pile of gzip
data and feed that into .GzipFile(...)
.
I'm aware that this was a client-side problem at first, but here's a server-side function to send a some what functional test:
def http_gzip(data):
compressed = gzip.compress(data)
# format(49, 'x') returns `31` which is `\x31` but without the `\x` notation.
# basically the same as `hex(49)` but ment for these kind of things.
return bytes(format(len(compressed), 'x')),'UTF-8') + b'\r\n' + compressed + b'\r\n0\r\n\r\n'
Upvotes: 1
Reputation: 112384
You can't split on \r\n
since the compressed data may contain, and if long enough, certainly will contain that sequence. You need to dechunk first using the length provided (e.g. the first length 1f50
) and feed the resulting chunks to decompress. The compressed data starts with the \x1f\x8b
.
The chunking is hex number, crlf, chunk with that many bytes, crlf, hex number, crlf, chunk, crlf, ..., last chunk (of zero length), [possibly some headers], crlf.
Upvotes: 3