Reputation: 57
Currently, I'm working on receive tcp stream and analyse HTTP data by python. I have already learned about how to decode chunked data at here. My problem is: when I hold whole HTTP response and start to decoded it, but prefix chunk size is quite smaller than actual size.I would show below:
This is pure data I've received:
b'000096F6\r\n<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" prefix="og: http://opengraphprotocol.org/schema/ fb: http://www.facebook.com/2010/fbml d: http://dictionary.com/2011/dml">\n<head>\n<meta http-equiv="Content-type" content="text/html; charset=utf-8"/>\n<base href="http://dictionary.reference.com/">\n<title>Search | Define Search at Dictionary.com</title>\n<script.....(more data)
You could see the prefix size is (hex)96F6 = 38646 (bytes)
But if I split data by this algorithm:
encoded = row_data;
new_data = ""
while encoded != '':
off = int(encoded[:encoded.index('\r\n')], 16)
if off == 0:
break
encoded = encoded[encoded.index('\r\n') + 2:]
new_data = new_data.__add__(encoded[:off])
encoded = encoded[off + 2:]
return new_data
I could just obtain two damaged group:
(more data).....<div class="dot dot-left dot-bottom "></
and
v>
<div class="language-name oneClick-disabled">.....(more data)
So it through me an exception that could not get off in next loop. As I carefully inspected response body, I got len(data) is 78543 and len(data.decode()) is 78503, and whole response just have only one chunk!
Then I tried lots of web set and they all have this problem.
So, my question is: what's wrong with me? How to correctly decode this type of data? Thanks for someone who can provide help!
Upvotes: 0
Views: 163
Reputation: 31087
Your sample code works well for me with the a response from https://www.facebook.com/
. For an easier-to-reproduce case, try the example from the Wikipedia article:
4\r\n
Wiki\r\n
5\r\n
pedia\r\n
e\r\n
in\r\n\r\nchunks.\r\n
0\r\n
\r\n
Or, as a Python string:
encoded = '4\r\nWiki\r\n5\r\npedia\r\ne\r\n in\r\n\r\nchunks.\r\n0\r\n\r\n'
With your code, this gives:
Wikipedia in
chunks.
as expected.
The two most likely errors elsewhere in your program are encoding or networking. Note that the chunk lengths are specified in bytes. If you've decoded or re-encoded row_data
at any point then you may not have the original data. Alternatively, make sure that you're concatenating the data read from the socket correctly without introducing any spurious spaces or newlines.
Upvotes: 1