Alois Mahdal
Alois Mahdal

Reputation: 11253

Decode content from httplib GET

I'm fetching a simple HTTP plain-text that is in CP-1250 (I can't influence that) and would like to decode it, process it per line and eventually save it as UTF-8.

The first part is causing me problems. After I get the raw data using response.read(), I'm passing it to a reader created by getreader("cp1250") from codecs library. I expect to get a StreamReader instance and simply call readlines to have a list of byte strings.

import codecs
import httplib

# nothing unusual
conn = httplib.HTTPConnection('server')
conn.request('GET', '/')
response = conn.getresponse()
content = response.read()

# the painful part
sr = codecs.getreader("cp1250")(content)
lines = sr.readlines()      # d'oh!

But after the call to readlines I only get yells echoing from somewhere deep inside codecs:

[...snip...]
  File "./download", line 123, in _parse
    lines = sr.readlines()
  File "/usr/lib/python2.7/codecs.py", line 588, in readlines
    data = self.read()
  File "/usr/lib/python2.7/codecs.py", line 471, in read
    newdata = self.stream.read()
AttributeError: 'str' object has no attribute 'read'

My prints confirm that sr is instance of StreamReader; it confuses me that the object seemed to initialize well but now fails to execute the readlines ... what is missing here?

Or is the library trying to cryptically tell me that the data is corrupted (not CP-1250)?

Edit: As jorispilot suggests, unicode(content, encoding="cp1250") works, so I'll probably stick with that for my solution. However, I'd still like to know what was wrong with my usage of codecs library.

Upvotes: 0

Views: 1777

Answers (2)

Simon Callan
Simon Callan

Reputation: 3130

According to http://docs.python.org/2/library/codecs.html, getreader() returns a StreamReader. This must be passed a stream, which implements the read() function, not, as you are doing, a string.

To fix this, don't read the data from response, but pass it directly to the StreamReader, as below.

conn = httplib.HTTPConnection('server')
conn.request('GET', '/')
response = conn.getresponse()

reader = codecs.getreader("cp1250")(response)
lines = sr.readlines()

Upvotes: 1

Fabian
Fabian

Reputation: 4348

utf8_lines = []
for line in content.split('\n'):
   line = line.strip().decode('cp1250')
   utf8_lines.append(line.encode('utf-8'))

Upvotes: 2

Related Questions