Decode content from httplib GET

Question

I'm fetching a simple HTTP plain-text that is in CP-1250 (I can't influence that) and would like to decode it, process it per line and eventually save it as UTF-8.

The first part is causing me problems. After I get the raw data using response.read(), I'm passing it to a reader created by getreader("cp1250") from codecs library. I expect to get a StreamReader instance and simply call readlines to have a list of byte strings.

import codecs
import httplib

# nothing unusual
conn = httplib.HTTPConnection('server')
conn.request('GET', '/')
response = conn.getresponse()
content = response.read()

# the painful part
sr = codecs.getreader("cp1250")(content)
lines = sr.readlines()      # d'oh!

But after the call to readlines I only get yells echoing from somewhere deep inside codecs:

[...snip...]
  File "./download", line 123, in _parse
    lines = sr.readlines()
  File "/usr/lib/python2.7/codecs.py", line 588, in readlines
    data = self.read()
  File "/usr/lib/python2.7/codecs.py", line 471, in read
    newdata = self.stream.read()
AttributeError: 'str' object has no attribute 'read'

My prints confirm that sr is instance of StreamReader; it confuses me that the object seemed to initialize well but now fails to execute the readlines ... what is missing here?

Or is the library trying to cryptically tell me that the data is corrupted (not CP-1250)?

Edit: As jorispilot suggests, unicode(content, encoding="cp1250") works, so I'll probably stick with that for my solution. However, I'd still like to know what was wrong with my usage of codecs library.

Simon Callan · Accepted Answer

According to http://docs.python.org/2/library/codecs.html, getreader() returns a StreamReader. This must be passed a stream, which implements the read() function, not, as you are doing, a string.

To fix this, don't read the data from response, but pass it directly to the StreamReader, as below.

conn = httplib.HTTPConnection('server')
conn.request('GET', '/')
response = conn.getresponse()

reader = codecs.getreader("cp1250")(response)
lines = sr.readlines()

Decode content from httplib GET

Answers (2)

Related Questions