Reputation: 1301
I have a Python 2.6 script that is gagging on special characters, encoded in Latin-1, that I am retrieving from a SQL Server database. I would like to print these characters, but I'm somewhat limited because I am using a library that calls the unicode
factory, and I don't know how to make Python use a codec other than ascii
.
The script is a simple tool to return lookup data from a database without having to execute the SQL directly in a SQL editor. I use the PrettyTable 0.5 library to display the results.
The core of the script is this bit of code. The tuples I get from the cursor contain integer and string data, and no Unicode data. (I'd use adodbapi
instead of pyodbc
, which would get me Unicode, but adodbapi
gives me other problems.)
x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)
t = PrettyTable(columns)
for rec in r:
t.add_row(rec)
r.close()
x.close()
t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t
But the Name
column can contain characters that fall outside the ASCII range. I'll sometimes get an error message like this, in line 222 of prettytable.pyc
, when it gets to the t.add_row
call:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 12: ordinal not in range(128)
This is line 222 in prettytable.py
. It uses unicode
, which is the source of my problems, and not just in this script, but in other Python scripts that I have written.
for i in range(0,len(row)):
if len(unicode(row[i])) > self.widths[i]: # This is line 222
self.widths[i] = len(unicode(row[i]))
Please tell me what I'm doing wrong here. How can I make unicode
work without hacking prettytable.py
or any of the other libraries that I use? Is there even a way to do this?
EDIT: The error occurs not at the print
statement, but at the t.add_row
call.
EDIT: With Bastien Léonard's help, I came up with the following solution. It's not a panacea, but it works.
x = pyodbc.connect(cxnstring)
r = x.cursor()
r.execute(sql)
t = PrettyTable(columns)
for rec in r:
urec = [s.decode('latin-1') if isinstance(s, str) else s for s in rec]
t.add_row(urec)
r.close()
x.close()
t.set_field_align("ID", 'r')
t.set_field_align("Name", 'l')
print t.get_string().encode('latin-1')
I ended up having to decode on the way in and encode on the way out. All of this makes me hopeful that everybody ports their libraries to Python 3.x sooner than later!
Upvotes: 7
Views: 50916
Reputation: 15855
After a quick peek at the source for PrettyTable, it appears that it works on unicode objects internally (see _stringify_row
, add_row
and add_column
, for example). Since it doesn't know what encoding your input strings are using, it uses the default encoding, usually ascii.
Now ascii is a subset of latin-1, which means if you're converting from ascii to latin-1, you shouldn't have any problems. The reverse however, isn't true; not all latin-1 characters map to ascii characters. To demonstrate this:
>>> s = u'\xed\x31\x32\x33'
>>> print s
# FAILS: Python calls "s.decode('ascii')", but ascii codec can't decode '\xed'
>>> print s.decode('ascii')
# FAILS: Same as above
>>> print s.decode('latin-1')
í123
Explicitly converting the strings to unicode (like you eventually did) fixes things, and makes more sense, IMO -- you're more likely to know what charset your data is using, than the author of PrettyTable :). BTW, you can omit the check for strings in your list comprehension by replacing .s.decode('latin-1')
with unicode(s, 'latin-1')
since all objects can be coerced to strings
One last thing: don't forget to check the character set of your database and tables -- you don't want to assume 'latin-1' in code, when the data is actually being stored as something else ('utf-8'?) in the database. In MySQL, you can use the SHOW CREATE TABLE <table_name>
command to find out what character set a table is using, and SHOW CREATE DATABASE <db_name>
to do the same for a database.
Upvotes: 0
Reputation: 61783
Add this at the beginning of the module:
# coding: latin1
Or decode the string to Unicode yourself.
[Edit]
It's been a while since I played with Unicode, but hopefully this example will show how to convert from Latin1 to Unicode:
>>> s = u'ééé'.encode('latin1') # a string you may get from the database
>>> s.decode('latin1')
u'\xe9\xe9\xe9'
[Edit]
Documentation:
http://docs.python.org/howto/unicode.html
http://docs.python.org/library/codecs.html
Upvotes: 7
Reputation: 42367
Maybe try to decode the latin1-encoded strings into unicode?
t.add_row((value.decode('latin1') for value in rec))
Upvotes: 2