Matt
Matt

Reputation: 5567

MySQL Character Encodings

I have a large database that was originally created with latin1 encoding and latin1_swedish_ci collations.

I've made a lot of changes and now I am using Doctrine2, which uses SET NAMES UTF8 before queries. I was using the Yii framework previously which was also setup to use UTF-8. Basically I overlooked all of this at the start but from what I understand, I've been writing UTF-8 data to the database, although it's all supposed to be latin1.

I'd like to convert my database to utf-8 but have no idea how to do it safely and ensure I'm not losing data.

Two questions:

  1. Is there a way to ensure that I have done this correctly? I have like 2 GB of data so I can't just scan through it to test, but are there certain characters that I can look for to determine if the database is using UTF-8 already?

  2. What is the proper way to convert everything? I've seen some people say I have to mysqldump and re-import (with certain flags, http://blog.makezine.com/2007/05/08/mysql-database-migration-latin/ and http://docs.moodle.org/22/en/Converting_your_MySQL_database_to_UTF8). Others say you can ALTER TABLE... per column (http://www.bothernomore.com/2008/12/16/character-encoding-hell/). And I've seen some command like SET utf8 as part of an ALTER TABLE I think, but I don't know if that works.

EDIT:

I've dumped the data and have found several accented characters. Would this indicate that the data itself is UTF-8? If so, I htink the instructions here apply since I can just "convert" to blob and back safely http://codex.wordpress.org/Converting_Database_Character_Sets

EDIT AGAIN:

After doing what I read in the previous link, I found that comparing the data, I was missing all characters after the first non-ascii(?) one. So I had titles set with curly single quotes and the new database had that character and all characters after it missing. As an example, here is the code I was running:

ALTER TABLE articles CHANGE title title VARBINARY(255) NOT NULL;
ALTER TABLE articles CHANGE title title VARCHAR(255) CHARACTER SET utf8 NOT NULL;

Somehow that caused me to lose data.

But if I dump, change the charset for each table from latin1 to utf8, it works. I'd prefer to just modify things rather than dump and re-create but I'll resort to it if no one has any other suggestions or ideas based on this.

Upvotes: 2

Views: 3700

Answers (2)

Matt
Matt

Reputation: 5567

I got it working with the dump and reimport. I followed this guide mostly: http://en.gentoo-wiki.com/wiki/Convert_latin1_to_UTF-8_in_MySQL

If anyone else is in my situation (you have UTF-8 data stored in a latin1 database which you've always accessed by calling SET NAMES utf8, then here's what you can try (my modifications from the above source).

mysqldump -h example.org --user=foo -p -c --insert-ignore --skip-set-charset -r dump.sql dbname

Double check that it is UTF-8 (mine was)

file dump.sql

Do the conversion on the dump

perl -pi -w -e 's/CHARSET=latin1/CHARSET=utf8/g;' dump.sql

Create a new database (I didn't drop the old just in case)

mysql --user=foo -p --execute="CREATE DATABASE dbnameutf8 CHARACTER SET utf8 COLLATE utf8_general_ci;"

Import

mysql --user=foo -p --default-character-set=utf8 dbnameutf8 < dump.sql

Hope that can help someone else. Just remember, the ALTER TABLE... stuff may not work (it did not in my case).

Upvotes: 1

hkf
hkf

Reputation: 4520

To check things you can do something like:

SELECT t1.*
FROM table_1 t1 
JOIN table_1 t2 on t1.guid = t2.guid AND
t1.field_1 <> t2.field_1 COLLATE UTF-8

Basically, on the fly, change the collation of the column and see if there's going to be anything that is going to go wrong.

You definitely don't need to re-import everything, ALTER TABLE to the new encoding should be fine, assuming everything will convert OK.

Upvotes: 2

Related Questions