dagnelies
dagnelies

Reputation: 5319

python: how to fetch an url? (with improper response headers)

I want to build a small script in python which needs to fetch an url. The server is a kind of crappy though and replies pure ASCII without any headers.

When I try:

import urllib.request
response = urllib.request.urlopen(url)
print(response.read())

I obtain a http.client.BadStatusLine: 100 error because this isn't a properly formatted HTTP response.

Is there another way to fetch an url and get the raw content, without trying to parse the response?

Thanks

Upvotes: 0

Views: 3430

Answers (3)

Marty
Marty

Reputation: 8260

It's difficult to answer your direct question without a bit more information; not knowing exactly how the (web) server in question is broken.

That said, you might try using something a bit lower-level, a socket for example. Here's one way (python2.x style, and untested):

#!/usr/bin/env python
import socket                                                                  
from urlparse import urlparse                                                  

def geturl(url, timeout=10, receive_buffer=4096):                              
    parsed = urlparse(url)                                                     
    try:                                                                       
        host, port = parsed.netloc.split(':')                                  
    except ValueError:                                                         
        host, port = parsed.netloc, 80                                         

    sock = socket.create_connection((host, port), timeout)                     
    sock.sendall('GET %s HTTP/1.0\n\n' % parsed.path)                          

    response = [sock.recv(receive_buffer)]                                     
    while response[-1]:                                                        
        response.append(sock.recv(receive_buffer))                             

    return ''.join(response)  

print geturl('http://www.example.com/') #<- the trailing / is needed if no 
                                            other path element is present

And here's a stab at a python3.2 conversion (you may not need to decode from bytes, if writing the response to a file for example):

#!/usr/bin/env python
import socket                                                                  
from urllib.parse import urlparse                                                  

ENCODING = 'ascii'

def geturl(url, timeout=10, receive_buffer=4096):                              
    parsed = urlparse(url)                                                     
    try:                                                                       
        host, port = parsed.netloc.split(':')                                  
    except ValueError:                                                         
        host, port = parsed.netloc, 80                                         

    sock = socket.create_connection((host, port), timeout)                     

    method  = 'GET %s HTTP/1.0\n\n' % parsed.path
    sock.sendall(bytes(method, ENCODING))

    response = [sock.recv(receive_buffer)]                                     
    while response[-1]:                                                        
        response.append(sock.recv(receive_buffer))                             

    return ''.join(r.decode(ENCODING) for r in response)

print(geturl('http://www.example.com/'))

HTH!

Edit: You may need to adjust what you put in the request, depending on the web server in question. Guanidene's excellent answer provides several resources to guide you on that path.

Upvotes: 1

Pushpak Dagade
Pushpak Dagade

Reputation: 6450

What you need to do in this case is send a raw HTTP request using sockets.
You would need to do a bit of low level network programming using the socket python module in this case. (Network sockets actually return you all the information sent by the server as it as, so you can accordingly interpret the response as you wish. For example, the HTTP protocol interprets the response in terms of standard HTTP headers - GET, POST, HEAD, etc. The high-level module urllib hides this header information from you and just returns you the data.)

You also need to have some basic information about HTTP headers. For your case, you just need to know about the GET HTTP request. See its definition here - http://djce.org.uk/dumprequest, see an example of it here - http://en.wikipedia.org/wiki/HTTP#Example_session. (If you wish to capture live traces of HTTP requests sent from your browser, you would need a packet sniffing software like wireshark.)

Once you know basics about socket module and HTTP headers, you can go through this example - http://coding.debuntu.org/python-socket-simple-tcp-client which tells you how to send a HTTP request over a socket to a server and read its reply back. You can also refer to this unclear question on SO.

(You can google python socket http to get more examples.)

(Tip: I am not a Java fan, but still, if you don't find enough convincing examples on this topic under python, try finding it under Java, and then accordingly translate it to python.)

Upvotes: 1

user850498
user850498

Reputation: 727

urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')

Upvotes: 0

Related Questions