Sanjuro
Sanjuro

Reputation: 87

Character encoding issue when using Python3/mysqlclient to retreive data encoded in latin1

I'm running into a character encoding issue when retrieving data from an older database that is using latin1 encoding. The problem is occurring when I try to retrieve characters from the database that fall in the \x80 to \x9f range, which is the range that is different between MySQL's latin1 (aka windows-1252 in Python) and the official latin1 (ISO-8859-1).

This is the stack that I'm using:

As an example, I'm trying to retrieve the word "Isn't" from the database where the apostrophe is encoded as \x92.

If I don't pass a charset to the mysqlclient connection through Django settings, I get the error "'utf-8' codec can't decode byte 0x92 in position 5: invalid start byte".

If I pass latin1 as the codec to the connection, there is no error but the word renders to the page as "Isn t", with blank space where the apostrophe should be.

When I open up a separate python shell session and try the connection from the python command line, the result is "Isn\x92t".

>>> import MySQLdb
>>> conex = MySQLdb.connect(host=<host>,db=<db>, user=<user>, passwd=<passwd>, charset="latin1")
>>> cursor = conex.cursor()
>>> cursor.execute("select <field> from <table> where id=<id>")
1L
>>> cursor.fetchall()
((u'Isn\x92t',),)

It doesn't seem to make any difference if I include the charset or not when making the call from the command line. So this connection string

>>> conex = MySQLdb.connect(host=<host>,db=<db>, user=<user>, passwd=<passwd>, charset="latin1")

and this connection string

>>> conex = MySQLdb.connect(host=<host>,db=<db>, user=<user>, passwd=<passwd>)

have the same result.

Is there a way to set options for the mysql connection string that will handle the windows-1252 codes properly? Any help would be appreciated.

========= Edit with additional info =========

Thanks for your response Rick James. The original text snippet went away, but I found another similar one that it's failing on: Women's.

Here's the HEX select:

mysql> SELECT title, HEX(title) from <table> where id = <id>
| title | HEX(title)
| Women?s | 576F6D656E9273

I'm not sure I feel comfortable putting the entire create table statement online but here are what I figured were the important bits of the SHOW CREATE TABLE. Let me know if you were looking for something else.

CREATE TABLE `tbl` (
  `title` varchar(255) DEFAULT NULL,
) ENGINE=MyISAM AUTO_INCREMENT=9460 DEFAULT CHARSET=latin1 

And lastly the SHOW VARIABLES LIKE 'char%'; results:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |

Modifying the charset in the Django database settings file to utf8mb4 resulted in the same unicode error as when it's set to utf8.

'OPTIONS': {
  'charset': 'utf8mb4',
  'use_unicode': True,
}

I'm still a bit confused about why the direct query using a standalone python environment with mysqlclient wouldn't work. That would at least take any of the Django issues out of the equation.

Upvotes: 4

Views: 1654

Answers (1)

Rick James
Rick James

Reputation: 142298

These charsets cp1250, cp1251, cp1256, cp1257, geostd8, latin1 treat x92 as .

The setup for Django should specify the character encoding you want to use in the client. This is independent of what is in the database. So the question becomes "Where are things misconfigured."

The u introducer is wrong since 92 is not valid UTF-8.

  • latin1 encoding at the column level -- fine
  • 0x92 in the data -- Verify by doing SELECT col, HEX(col) ...; Isn't should say 49736E9274.
  • Python source code should start with # -*- coding: utf-8 -*- if you want to use UTF-8 in the client.
  • Django needs this for utf8mb4 in the client:

    DATABASES = {
        'default': {
            'ENGINE': 'django.db.backends.mysql',
            ...
            'OPTIONS': {
                        'charset': 'utf8mb4',
                        'use_unicode': True, },
        },
    }
    

For more help, please provide

SHOW CREATE TABLE ...;
SHOW VARIABLES LIKE 'char%';
the SELECT HEX mentioned above

Further notes (after the Question was extended considerably)

Woman?s -- see question mark in Trouble with UTF-8 characters; what I see is not what I stored -- It is probably derived from having the connection (char%) being latin1, but the client (DJango) talking utf8mb4. 3 of the char% settings declare what the client is using for encoding; the DJango setting contradicts that. Change one or the other. The table column does not need to match them; data will be converted as needed. That is, the 92 in a latin1 column is OK even with character_client/connection/results = utf8mb4.

For this Question, utf8 and utf8mb4 will act the same.

Upvotes: 1

Related Questions