Reputation: 5319
I want to build a small script in python which needs to fetch an url. The server is a kind of crappy though and replies pure ASCII without any headers.
When I try:
import urllib.request
response = urllib.request.urlopen(url)
print(response.read())
I obtain a http.client.BadStatusLine: 100
error because this isn't a properly formatted HTTP response.
Is there another way to fetch an url and get the raw content, without trying to parse the response?
Thanks
Upvotes: 0
Views: 3430
Reputation: 8260
It's difficult to answer your direct question without a bit more information; not knowing exactly how the (web) server in question is broken.
That said, you might try using something a bit lower-level, a socket
for example. Here's one way (python2.x style, and untested):
#!/usr/bin/env python
import socket
from urlparse import urlparse
def geturl(url, timeout=10, receive_buffer=4096):
parsed = urlparse(url)
try:
host, port = parsed.netloc.split(':')
except ValueError:
host, port = parsed.netloc, 80
sock = socket.create_connection((host, port), timeout)
sock.sendall('GET %s HTTP/1.0\n\n' % parsed.path)
response = [sock.recv(receive_buffer)]
while response[-1]:
response.append(sock.recv(receive_buffer))
return ''.join(response)
print geturl('http://www.example.com/') #<- the trailing / is needed if no
other path element is present
And here's a stab at a python3.2 conversion (you may not need to decode from bytes, if writing the response to a file for example):
#!/usr/bin/env python
import socket
from urllib.parse import urlparse
ENCODING = 'ascii'
def geturl(url, timeout=10, receive_buffer=4096):
parsed = urlparse(url)
try:
host, port = parsed.netloc.split(':')
except ValueError:
host, port = parsed.netloc, 80
sock = socket.create_connection((host, port), timeout)
method = 'GET %s HTTP/1.0\n\n' % parsed.path
sock.sendall(bytes(method, ENCODING))
response = [sock.recv(receive_buffer)]
while response[-1]:
response.append(sock.recv(receive_buffer))
return ''.join(r.decode(ENCODING) for r in response)
print(geturl('http://www.example.com/'))
HTH!
Edit: You may need to adjust what you put in the request, depending on the web server in question. Guanidene's excellent answer provides several resources to guide you on that path.
Upvotes: 1
Reputation: 6450
What you need to do in this case is send a raw HTTP request using sockets
.
You would need to do a bit of low level network programming using the socket
python module in this case. (Network sockets actually return you all the information sent by the server as it as
, so you can accordingly interpret the response as you wish. For example, the HTTP protocol interprets the response in terms of standard HTTP headers - GET, POST, HEAD, etc. The high-level module urllib
hides this header information from you and just returns you the data.)
You also need to have some basic information about HTTP headers. For your case, you just need to know about the GET
HTTP request. See its definition here - http://djce.org.uk/dumprequest, see an example of it here - http://en.wikipedia.org/wiki/HTTP#Example_session. (If you wish to capture live traces of HTTP requests sent from your browser, you would need a packet sniffing software like wireshark.)
Once you know basics about socket
module and HTTP headers
, you can go through this example - http://coding.debuntu.org/python-socket-simple-tcp-client which tells you how to send a HTTP request over a socket to a server and read its reply back. You can also refer to this unclear question on SO.
(You can google python socket http
to get more examples.)
(Tip: I am not a Java fan, but still, if you don't find enough convincing examples on this topic under python, try finding it under Java, and then accordingly translate it to python.)
Upvotes: 1