Reputation: 87
I'm running into a character encoding issue when retrieving data from an older database that is using latin1
encoding. The problem is occurring when I try to retrieve characters from the database that fall in the \x80
to \x9f
range, which is the range that is different between MySQL's latin1
(aka windows-1252
in Python) and the official latin1
(ISO-8859-1).
This is the stack that I'm using:
latin1
encoding at the column level and latin1-swedish-ci
collation at the table level.As an example, I'm trying to retrieve the word "Isn't" from the database where the apostrophe is encoded as \x92
.
If I don't pass a charset to the mysqlclient connection through Django settings, I get the error "'utf-8' codec can't decode byte 0x92 in position 5: invalid start byte".
If I pass latin1 as the codec to the connection, there is no error but the word renders to the page as "Isn t", with blank space where the apostrophe should be.
When I open up a separate python shell session and try the connection from the python command line, the result is "Isn\x92t".
>>> import MySQLdb
>>> conex = MySQLdb.connect(host=<host>,db=<db>, user=<user>, passwd=<passwd>, charset="latin1")
>>> cursor = conex.cursor()
>>> cursor.execute("select <field> from <table> where id=<id>")
1L
>>> cursor.fetchall()
((u'Isn\x92t',),)
It doesn't seem to make any difference if I include the charset or not when making the call from the command line. So this connection string
>>> conex = MySQLdb.connect(host=<host>,db=<db>, user=<user>, passwd=<passwd>, charset="latin1")
and this connection string
>>> conex = MySQLdb.connect(host=<host>,db=<db>, user=<user>, passwd=<passwd>)
have the same result.
Is there a way to set options for the mysql connection string that will handle the windows-1252 codes properly? Any help would be appreciated.
========= Edit with additional info =========
Thanks for your response Rick James. The original text snippet went away, but I found another similar one that it's failing on: Women's.
Here's the HEX select:
mysql> SELECT title, HEX(title) from <table> where id = <id>
| title | HEX(title)
| Women?s | 576F6D656E9273
I'm not sure I feel comfortable putting the entire create table statement online but here are what I figured were the important bits of the SHOW CREATE TABLE
. Let me know if you were looking for something else.
CREATE TABLE `tbl` (
`title` varchar(255) DEFAULT NULL,
) ENGINE=MyISAM AUTO_INCREMENT=9460 DEFAULT CHARSET=latin1
And lastly the SHOW VARIABLES LIKE 'char%';
results:
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
Modifying the charset in the Django database settings file to utf8mb4
resulted in the same unicode error as when it's set to utf8
.
'OPTIONS': {
'charset': 'utf8mb4',
'use_unicode': True,
}
I'm still a bit confused about why the direct query using a standalone python environment with mysqlclient
wouldn't work. That would at least take any of the Django issues out of the equation.
Upvotes: 4
Views: 1654
Reputation: 142298
These charsets cp1250, cp1251, cp1256, cp1257, geostd8, latin1
treat x92 as ’
.
The setup for Django should specify the character encoding you want to use in the client. This is independent of what is in the database. So the question becomes "Where are things misconfigured."
The u
introducer is wrong since 92 is not valid UTF-8.
SELECT col, HEX(col) ...
; Isn't
should say 49736E9274
.# -*- coding: utf-8 -*-
if you want to use UTF-8 in the client.Django needs this for utf8mb4 in the client:
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.mysql',
...
'OPTIONS': {
'charset': 'utf8mb4',
'use_unicode': True, },
},
}
For more help, please provide
SHOW CREATE TABLE ...;
SHOW VARIABLES LIKE 'char%';
the SELECT HEX mentioned above
Further notes (after the Question was extended considerably)
Woman?s
-- see question mark in Trouble with UTF-8 characters; what I see is not what I stored -- It is probably derived from having the connection (char%) being latin1, but the client (DJango) talking utf8mb4. 3 of the char% settings declare what the client is using for encoding; the DJango setting contradicts that. Change one or the other. The table column does not need to match them; data will be converted as needed. That is, the 92 in a latin1 column is OK even with character_client/connection/results = utf8mb4.
For this Question, utf8 and utf8mb4 will act the same.
Upvotes: 1