Reputation: 99
Django version 2.0. Python 3
My database charset and collation:
mysql> SELECT @@character_set_database, @@collation_database;
+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| latin1 | latin1_swedish_ci |
+--------------------------+----------------------+
Old developer inserted data in KOI8-R encoding using Perl :(
To get correct values from database I used ugly construction str(username).encode('latin1').decode('koi8-r')
. And what? I need to use it in all my project to send data to output? Or write function to encode context dictionary, but i also need additional to encode/decode all data. It will affect the usability and productivity
Without this i get something like ëÏÚÌÑÎËÏ òÏÍÁÎ éÏÓÉÆÏ×ÉÞ
How to globally set encoding in Django to prevent encode/decode operation in every place? I changed encoding different ways and nothing happens.
In settings.py I tried to set DEFAULT_CHARSET into different encodings (if I set default_charset to KOI8-R i get an error: UnicodeEncodeError: 'charmap' codec can't encode characters in position 6228-6235: character maps to . With other encodings no errors but no result). I tried to set in Database section of settings.py different values of charset and collation.
'OPTIONS': {
'charset': 'latin1',
'init_command': "SET sql_mode='STRICT_TRANS_TABLES', character_set_client=latin1, character_set_results=latin1, character_set_connection=latin1, collation_connection=latin1_swedish_ci",
}
I added <meta http-equiv="Content-type" content="text/html; charset=koi8-r (or other)" />
to <head>
tag in index.html template. No result.
It seems that Django execute SET NAMES utf8
everytime
Why in Perl i can send header with charset=koi8-r and i get normal values from this tables in my browser using CGI? Why no similar result in Python with Django or Flask? Simple example in Perl
Upvotes: 3
Views: 1030
Reputation: 142298
I'm pretty sure you need the 2-step ALTER
:
You have CHARACTER SET latin1
, but have non-latin1 bytes. You need to leave bytes alone while fixing charset:
First, lets assume you have this declaration for tbl.col:
col VARCHAR(111) CHARACTER SET latin1 NOT NULL
To convert the column without changing the bytes:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET koi8r NOT NULL;
Note: If you start with TEXT, use BLOB as the intermediate definition. (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)
-- http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
As you will see in that link, there are many different fixes for different scenarios. If you have already applied some of the other Answers, you may have made things worse! If you need further help, please provide a sample with
SELECT col, HEX(col) FROM ... WHERE ...
I may be able to deduce where things stand. ëÏÚÌÑÎËÏ òÏÍÁÎ éÏÓÉÆÏ×ÉÞ
and þÅÐÒÑÇÏ×Á éÎÅÓÓÁ ÷ÉËÔÏÒÏ×ÎÁ
look thoroughly mangled; a quick glance says that maybe a wrong conversion was applied. What was the source Korean text?
Upvotes: 0
Reputation: 99
I solved the problem.
The first way
ALTER DATABASE dbname CHARACTER SET koi8r COLLATE koi8r_general_ci;
and use_unicode=True, charset="utf8"
attributes in connection.
But I can not change active database on the server, only on the test PC.
The second way
In my virtual environment i have edited lib/python3.6/site-packages/MySQLdb/connections.py:
In Connection
class I have added:
self.force_koi8r = kwargs2.pop('force_koi8r', kwargs.get('force_koi8r', False))
and below edited part of code:
def set_character_set(self, charset):
#if charset == "utf8mb4":
# py_charset = "utf8"
#else:
# py_charset = charset
# bugfix:
if charset == "utf8mb4":
py_charset = "utf8"
elif charset == "latin1" and self.force_koi8r == True:
py_charset = 'koi8-r'
else:
py_charset = charset
Now it works fine with force_koi8r=True
argument, but that is not a good solution. This solution is only for this project
Problem was in the title of encoding. Python knows this encoding as 'koi8-r' or 'koi8_r'. But MySQL knows it as 'koi8r'. And if I set charset=koi8-r - mysql gives an error, if I set charset=koi8r - Python gives an error. (unknown encoding)
The third way
If anybody knows how to add an alias of encoding in Python (default: koi8-r, koi8_r and i need to add koi8r) please tell me
Upvotes: 0
Reputation: 27704
I think you're confusing network character encoding with storage encoding. In MySQL the life of string data is roughly this:
disk_storage --decode--> MySQL --encode--> network --decode--> database_driver
When string data is read from the disk, MySQL decodes it using the character_set_database
value.
When a client connects over the network, the client specifies an encoding for the connection. For Python this is typically UTF-8. MySQL then encodes the data to the connection encoding.
The Python Mysql driver then decodes the data it receives using the connection encoding it set.
If anyone of those decodings or encodings use the wrong value, then bad data will be created. If character_set_database
has been set incorrectly, then MySQL will be decoding the data incorrectly before encoding the bad data on the network connection.
The solution should be as simple as changing character_set_database
to the correct value without changing the actual data.
This can be achieved with:
ALTER DATABASE dbname CHARACTER SET koi8r COLLATE koi8r_general_ci;
(DO NOT RUN ALTER TABLE tbl_name CONVERT..
- this will actually re-encode your data. As the old character_set value was wrong, your data will be decoded incorrectly before encoding to the new encoding)
Change all Python settings back to their default (UTF-8 etc). Do not set DEFAULT_CHARSET
or any other value.
To ensure that MySQL driver connects correctly and uses UTF-8 for the network connection set use_unicode=True
and charset="utf8"
E.g.
>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
Upvotes: 1