Serg
Serg

Reputation: 99

Django: Unicode, MySQL and Encodings (latin1, koi8-r)

Django version 2.0. Python 3

My database charset and collation:

mysql> SELECT @@character_set_database, @@collation_database;
+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| latin1                   | latin1_swedish_ci    |
+--------------------------+----------------------+

Old developer inserted data in KOI8-R encoding using Perl :(

To get correct values from database I used ugly construction str(username).encode('latin1').decode('koi8-r'). And what? I need to use it in all my project to send data to output? Or write function to encode context dictionary, but i also need additional to encode/decode all data. It will affect the usability and productivity

Without this i get something like ëÏÚÌÑÎËÏ òÏÍÁÎ éÏÓÉÆÏ×ÉÞ

How to globally set encoding in Django to prevent encode/decode operation in every place? I changed encoding different ways and nothing happens.

In settings.py I tried to set DEFAULT_CHARSET into different encodings (if I set default_charset to KOI8-R i get an error: UnicodeEncodeError: 'charmap' codec can't encode characters in position 6228-6235: character maps to . With other encodings no errors but no result). I tried to set in Database section of settings.py different values of charset and collation.

'OPTIONS': {
    'charset': 'latin1',
    'init_command': "SET sql_mode='STRICT_TRANS_TABLES', character_set_client=latin1, character_set_results=latin1, character_set_connection=latin1, collation_connection=latin1_swedish_ci",
}

I added <meta http-equiv="Content-type" content="text/html; charset=koi8-r (or other)" /> to <head> tag in index.html template. No result.

It seems that Django execute SET NAMES utf8 everytime

Why in Perl i can send header with charset=koi8-r and i get normal values from this tables in my browser using CGI? Why no similar result in Python with Django or Flask? Simple example in Perl

Upvotes: 3

Views: 1030

Answers (3)

Rick James
Rick James

Reputation: 142298

I'm pretty sure you need the 2-step ALTER:

You have CHARACTER SET latin1, but have non-latin1 bytes. You need to leave bytes alone while fixing charset:

First, lets assume you have this declaration for tbl.col:

col VARCHAR(111) CHARACTER SET latin1 NOT NULL

To convert the column without changing the bytes:

ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET koi8r NOT NULL;

Note: If you start with TEXT, use BLOB as the intermediate definition. (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)

-- http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

As you will see in that link, there are many different fixes for different scenarios. If you have already applied some of the other Answers, you may have made things worse! If you need further help, please provide a sample with

SELECT col, HEX(col) FROM ... WHERE ...

I may be able to deduce where things stand. ëÏÚÌÑÎËÏ òÏÍÁÎ éÏÓÉÆÏ×ÉÞ and þÅÐÒÑÇÏ×Á éÎÅÓÓÁ ÷ÉËÔÏÒÏ×ÎÁ look thoroughly mangled; a quick glance says that maybe a wrong conversion was applied. What was the source Korean text?

Upvotes: 0

Serg
Serg

Reputation: 99

I solved the problem.

The first way

ALTER DATABASE dbname CHARACTER SET koi8r COLLATE koi8r_general_ci;

and use_unicode=True, charset="utf8" attributes in connection. But I can not change active database on the server, only on the test PC.

The second way

In my virtual environment i have edited lib/python3.6/site-packages/MySQLdb/connections.py:

In Connection class I have added:

self.force_koi8r = kwargs2.pop('force_koi8r', kwargs.get('force_koi8r', False))

and below edited part of code:

def set_character_set(self, charset):
    #if charset == "utf8mb4":
    #    py_charset = "utf8"
    #else:
    #    py_charset = charset

    # bugfix:
    if charset == "utf8mb4":
        py_charset = "utf8"
    elif charset == "latin1" and self.force_koi8r == True:
        py_charset = 'koi8-r'
    else:
        py_charset = charset

Now it works fine with force_koi8r=True argument, but that is not a good solution. This solution is only for this project

Problem was in the title of encoding. Python knows this encoding as 'koi8-r' or 'koi8_r'. But MySQL knows it as 'koi8r'. And if I set charset=koi8-r - mysql gives an error, if I set charset=koi8r - Python gives an error. (unknown encoding)

The third way

If anybody knows how to add an alias of encoding in Python (default: koi8-r, koi8_r and i need to add koi8r) please tell me

Upvotes: 0

Alastair McCormack
Alastair McCormack

Reputation: 27704

I think you're confusing network character encoding with storage encoding. In MySQL the life of string data is roughly this:

disk_storage --decode--> MySQL --encode--> network --decode--> database_driver

When string data is read from the disk, MySQL decodes it using the character_set_database value. When a client connects over the network, the client specifies an encoding for the connection. For Python this is typically UTF-8. MySQL then encodes the data to the connection encoding. The Python Mysql driver then decodes the data it receives using the connection encoding it set.

If anyone of those decodings or encodings use the wrong value, then bad data will be created. If character_set_database has been set incorrectly, then MySQL will be decoding the data incorrectly before encoding the bad data on the network connection.

The solution should be as simple as changing character_set_database to the correct value without changing the actual data.

This can be achieved with:

ALTER DATABASE dbname CHARACTER SET koi8r COLLATE koi8r_general_ci;

(DO NOT RUN ALTER TABLE tbl_name CONVERT.. - this will actually re-encode your data. As the old character_set value was wrong, your data will be decoded incorrectly before encoding to the new encoding)

Change all Python settings back to their default (UTF-8 etc). Do not set DEFAULT_CHARSET or any other value.

To ensure that MySQL driver connects correctly and uses UTF-8 for the network connection set use_unicode=True and charset="utf8"

E.g.

>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")

Upvotes: 1

Related Questions