Spaceship222
Spaceship222

Reputation: 849

Extract gzip content from raw http response

I try to do http(not https scheme, i.e url is http://www.example.com) get simply by socket module, then I recv response which contains all tranferred data from server(header and body with gzip encoded).Then I try to extract gzipped body content. I guess this content should start at \x1f\x8b\x08 , but I don't know where it should end.Any help?

Below is my raw response

HTTP/1.1 200 OK\r\n
Header Part\r\n
\r\n
some_number_here\r\n
\x1f\x8b\x08 ......
......\r\n
0\r\n
\r\n

Upvotes: 2

Views: 548

Answers (1)

regilero
regilero

Reputation: 30496

I bet that in the Header part you have an Transfer-Encoding: chunked header.

This is an HTTP/1.1 response, not an HTTP/1.0, and understanding chunked transmission is required in the 1.1 version of HTTP.

You have two solutions:

  • tell the server you do not understand HTTP/1.1 by using HTTP/1.0 in your requests, on the first line, like in GET /foo HTTP/1.0
    • implement the chunked transmission parsing.

The parsing is not so hard. Instead of a raw body you have a body splitted in parts (chunks); each part start with the chunk size (the some_number_here\r\n stuff), it's an hexadecimal number(warning 10 means 16, 1c means 28).

Then you have the raw chunk content.

Then the next chunk.

Until you reach the last chunk, which is advertized with a 0 size (0\r\n\r\n).

Warning: the server may take some time between chunks, you have to keep reading the socket until you see this last chunk.

PS: do not try to implement HTTP with sockets for something that would go into production later, there are a lot of HTTP clients available, even in python, and it's a very huge job to get something secure and robust.

Upvotes: 1

Related Questions