robots.txt
robots.txt

Reputation: 137

Can't fetch desired content using socket

I'm trying to get the visible content from here using socket but unfortunately I'm getting an error when I execute my script. As I'm very new to code using socket, I can't understand as to where I'm going wrong.

My code:

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
host_ip = socket.gethostbyname('data.pr4e.org')
s.connect((host_ip,80))
cmd = "GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n".encode()
s.send(cmd)

while True:
    data = s.recv(1024)
    if (len(data) <1 ):
        break
    print(data.decode())
s.close()

Error I'm getting:

400 Bad Request

Your browser sent a request that this server could not understand.

Upvotes: 1

Views: 71

Answers (2)

Kevin
Kevin

Reputation: 30151

There are multiple problems here:

  1. It is uncommon to include http://data.pr4e.org after GET (see RFC 7230) unless talking to a proxy. You will usually write GET /romeo.txt and provide the hostname in a separate Host: data.pr4e.org header. Servers are required to support the form you used, but they might violate the standard and choke on it. This is especially likely if you claim to be using HTTP/1.0, which is stricter and forbids this form unless talking to a proxy.
  2. Nobody uses HTTP/1.0 any more. All modern browsers and other HTTP clients use HTTP/1.1 or HTTP/2. Some servers will support HTTP/1.0, but it's not mandatory. Note that HTTP/1.1 makes the Host: header mandatory, even when you put the full URL after GET.
  3. HTTP/1.0 uses \r\n ("CRLF") as a newline (see RFC 1945), so \n may not always be understood. Again, some servers will handle it correctly, but it is non-conforming. The use of CRLF has been carried over to HTTP/1.1.
  4. print(data.decode()) will add an extra newline at the end of data. This could become an issue if TCP fragments a large HTTP response so that recv() returns multiple nonempty strings. Use print(data.decode(), end='') instead.

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71451

I was able to obtain the desired result by adding \r\n\r\n to the end of the request command, rather than the original \n\n:

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((socket.gethostbyname('data.pr4e.org'), 80))
s.sendall("GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n".encode())
print(s.recv(1024))

Output:

...
Content-Type: text/plain\r\n\r\nBut soft what light through yonder window breaks\nIt is the east and Juliet is the sun\nArise fair sun and kill the envious moon\nWho is already sick and pale with grief\n'

Upvotes: 1

Related Questions