maestro416
maestro416

Reputation: 924

mysql collation type for multilanguage support

I'm creating a website that'll store tutorial videos in several different languages. English will be the primary audience, but I expect french accents to be used in usernames/pwds along with swedish/norwegian accents/characters as well.

The languages for the tutorial videos will also be offered in chinese (both cantonese/mandarin), urdu/hindi, farsi/dari, and arabic. While I'm pretty sure the last few use standard qwerty keyboards for the net, especially to register online with - I do know that european keyboards vary and have several accents and ligatures to them.

I was wondering as far as mysql is concerned in terms of storing usernames and email addresses, which collation type would be best suited to support the most probable entries? I know I probably cannot cover them all, but I'd like to do as much as possible.

I've read that uft8_general_ci is better, but how would it vary from latin_1 swedish_ci if I'm looking to support those scandanavian characters?

EDIT: the user_id field and email fields will be unique - so [email protected] would not be the same as fré[email protected]

Upvotes: 2

Views: 2125

Answers (2)

deceze
deceze

Reputation: 521995

The collation is pretty irrelevant here for storing data. It only specifies rules for comparison and sorting. What you need is the right charset, which should be utf8. If your MySQL version is >= 5.5, you should even use utf8mb4 or utf16, both of which cover the entirety of Unicode (MySQL's utf8 is a limited subset of real UTF-8, covering only the BMP). A latin1 charset limits you to the 256 characters defined in it.

If you want to avoid similar entries to be seen as the same thing, use the appropriate _bin collation.

Upvotes: 1

Björn
Björn

Reputation: 29381

I wouldn't use utf8_general_ci, and use utf8_unicode_ci instead. It has much better support for sorting and comparisons, you can derive down utf8_unicode_ci to multiple other collation types - for example utf8_swedish_ci to get the correct swedish sorting and comparison.

The con is that it's somewhat slower than utf8_general_ci, but IMO you gain so much more.

Upvotes: 0

Related Questions