John
John

Reputation: 15286

Is it possible to peek at the data in a urllib2 response?

I need to detect character encoding in HTTP responses. To do this I look at the headers, then if it's not set in the content-type header I have to peek at the response and look for a "<meta http-equiv='content-type'>" header. I'd like to be able to write a function that looks and works something like this:

response = urllib2.urlopen("http://www.example.com/")
encoding = detect_html_encoding(response)
...
page_text = response.read()

However, if I do response.read() in my "detect_html_encoding" method, then the subseuqent response.read() after the call to my function will fail.

Is there an easy way to peek at the response and/or rewind after a read?

Upvotes: 1

Views: 518

Answers (2)

Alex Martelli
Alex Martelli

Reputation: 881735

def detectit(response):
   # try headers &c, then, worst case...:
   content = response.read()
   response.read = lambda: content
   # now detect based on content

The trick of course is ensuring that response.read() WILL return the same thing again if needed... that's why we assign that lambda to it if necessary, i.e., if we already needed to extract the content -- that ensures the same content can be extracted again (and again, and again, ...;-).

Upvotes: 4

orip
orip

Reputation: 75427

  1. If it's in the HTTP headers (not the document itself) you could use response.info() to detect the encoding
  2. If you want to parse the HTML, save the response data:

    page_text = response.read()
    encoding = detect_html_encoding(response, page_text)
    

Upvotes: 0

Related Questions