Reputation: 6461
I am connecting to a MS SQL server through SQL Alchemy, using pyodbc module. Everything appears to be working fine, until I began having problems with the encodings. Some of the non-ascii characters are being replaced with '?'
The DB has a collation 'Latin1_General_CI_AS' (I've checked also the specific fields and they keep the same collation). I started selecting the encoding 'latin1' in the call of create_engine
and that appears to work for Western European character (like French or Spanish, characters like é
) but not for Easter European characters. Specifically, I have a problem with the character ć
I have been trying to select other encodings as stated on Python documentation, specifically the Microsoft ones, like cp1250
and cp1252
, but I keep facing the same problem.
Does anyone knows how to solve those differences? Does the collation 'Latin1_General_CI_AS' has an equivalence on Python encodings?
The code for my current connection is the following
for sqlalchemy import *
def connect():
return pyodbc.connect('DSN=database;UID=uid;PWD=password')
engine = create_engine('mssql://', creator=connect, encoding='latin1')
connection = engine.connect()
Clarifications and comments:
ć
be stored? Maybe I'm not correctly understanding collations.latin1
, also cp1250
and cp1252
(which apparently is the one used on 'Latin1_General_CI_AS', according to msdn)UPDATE:
OK, Following these steps, I get that the encoding used by the DB appears to be cp1252: http://bytes.com/topic/sql-server/answers/142972-characters-encoding Anyway, that appears to be a bad assumption as reflected on answers.
UPDATE2: Anyway, after configuring properly the odbc driver, I don't need to specify the encoding on the Python code.
Upvotes: 5
Views: 13915
Reputation: 28666
Try connecting to the db with the pyodbc.connect() parameter convert_unicode=True
, eg. from sqlalchemy:
engine = create_engine('mssql://yourdb', connect_args={'convert_unicode': True})
This should make sure that all the results (and not only those from nvarchar
etc...) you get are unicode, correctly converted from whatever encoding is used in the db.
As for writing to the db, just always use unicode. If I'm not mistaken (will check later), pyodbc will make sure it will get written to the db correctly as well.
(of course, if the db uses an encoding that does not support the characters you want to write, you will still get errors: if you want the columns to support any kind of character, you will have to use unicode columns on the db too)
Upvotes: 1
Reputation: 7033
OK, per http://msdn.microsoft.com/en-us/library/ms174596(v=SQL.90).aspx the encoding of Latin1_General_CI_AS
is most probably cp1252
. So, you'd have to use encoding='cp1252'
. But this could solve only halve of the problem, because you have to output the values somehow to see, whether the characters are there or not. So if you have some_db_value
, which you extracted from the database, you have to some_db_value.encode('proper-output-encoding')
to have it right. proper-output-encoding
depends, how you output this: on the console, it is the console encoding, which can be anything like 'cp1252', 'cp437', 'cp850' (on windows). On the web, it is the encoding of the webserver, hopefully 'utf-8'.
edit: Please read John Machin's answer, as it is not clear whether 'cp1252' is the correct database encoding
Upvotes: 0
Reputation: 82934
Original comment turned into an answer:
cp1250 and cp1252 are NOT "latin1 encodings". A collation is not an encoding. Re your comment: Who says that "the server is encoded in latin1"? If the server expects all input/output to be encoded in latin1 (which I doubt), then you quite simply can't get some Eastern European characters into your database (nor Russian, Chinese, Greek, etc etc).
Update:
You need to look further afield than the collation. """msdn.microsoft.com/en-us/library/ms174596(v=SQL.90).aspx suggests, for Latin1_General_CI_AS the used encoding is cp1252""" is codswallop. The table provides an LCID (locale ID), default collation, and codepage for each locale. Yes, the collation "Latin1_General_CI_AS" is listed in association with the cp1252 codepage for several locales. For two locales (Armenian and Georgian), it is listed in association with the "Unicode" codepage (!!!).
Quite simply, you need to find out what codepage the database is using.
Try to extract data from the database without specifing an encoding at all. Don't bother encoding that in whatever encoding you guess your console may be using -- that only adds another source of confusion. Instead, use print repr(data)
. Report back here what you get from the repr() where you expect non-Latin1 characters.
Upvotes: 2
Reputation: 170410
You should stop using code pages and switch to Unicode. This is the only way of getting rid of this kind of problems.
Upvotes: 2