fire
fire

Reputation: 21531

UTF8 encoded text being altered incorrectly when changing collation

We have a MySQL table with fields using latin1_swedish_ci however there is some UTF-8 encoded data in there.

When I change the collation from latin1_swedish_ci to utf8_general_ci it converts the ISO-8859-1 stuff fine, but the UTF-8 data gets broken, seeing things like... you’re.

Any ideas?

Edit: we are using InnoDB with MySQL 5.1

Upvotes: 0

Views: 114

Answers (1)

Joni
Joni

Reputation: 111259

So, your "latin1" column had some data encoded in UTF-8, but not everything, and now you have some data doubly-encoded in UTF-8?

You could do this to fix the doubly encoded values: first convert back to latin1, re-interprete the string as binary, and tell MySQL that the binary string is actually text encoded in UTF-8. In MySQL this can be written as:

convert(binary convert(mycolumn using latin1) using utf8)

Now the problem is detecting which strings are doubly encoded, so that you can update only those. This can be done by comparing the length of the original string in characters with the length of the new string in bytes; for doubly-encoded text they should be equal. This is how you end up with:

update mytable set mycolumn = @str where char_length(mycolumn) = 
    length(@str := convert(binary convert(mycolumn using latin1) using utf8));

Upvotes: 1

Related Questions