Reputation: 11253
I'm fetching a simple HTTP plain-text that is in CP-1250 (I can't influence that) and would like to decode it, process it per line and eventually save it as UTF-8.
The first part is causing me problems. After I get the raw data using response.read()
, I'm passing it to a reader created by getreader("cp1250")
from codecs library. I expect to get a StreamReader instance and simply call readlines
to have a list of byte strings.
import codecs
import httplib
# nothing unusual
conn = httplib.HTTPConnection('server')
conn.request('GET', '/')
response = conn.getresponse()
content = response.read()
# the painful part
sr = codecs.getreader("cp1250")(content)
lines = sr.readlines() # d'oh!
But after the call to readlines
I only get yells echoing from somewhere deep inside codecs:
[...snip...]
File "./download", line 123, in _parse
lines = sr.readlines()
File "/usr/lib/python2.7/codecs.py", line 588, in readlines
data = self.read()
File "/usr/lib/python2.7/codecs.py", line 471, in read
newdata = self.stream.read()
AttributeError: 'str' object has no attribute 'read'
My print
s confirm that sr
is instance of StreamReader; it confuses me that the object seemed to initialize well but now fails to execute the readlines ... what is missing here?
Or is the library trying to cryptically tell me that the data is corrupted (not CP-1250)?
Edit: As jorispilot suggests, unicode(content, encoding="cp1250")
works, so I'll probably stick with that for my solution. However, I'd still like to know what was wrong with my usage of codecs library.
Upvotes: 0
Views: 1777
Reputation: 3130
According to http://docs.python.org/2/library/codecs.html, getreader()
returns a StreamReader
. This must be passed a stream, which implements the read()
function, not, as you are doing, a string.
To fix this, don't read the data from response
, but pass it directly to the StreamReader
, as below.
conn = httplib.HTTPConnection('server')
conn.request('GET', '/')
response = conn.getresponse()
reader = codecs.getreader("cp1250")(response)
lines = sr.readlines()
Upvotes: 1
Reputation: 4348
utf8_lines = []
for line in content.split('\n'):
line = line.strip().decode('cp1250')
utf8_lines.append(line.encode('utf-8'))
Upvotes: 2