Reputation: 2232
I am new to multilingual data and my confession is that I never did tried it before. Currently I am working on a multilingual site, but I do not know which language will be used.
Which collation/character set of MySQL should I use to achieve this?
Should I use some Unicode type of character set?
And of course these languages are not out of this universe, these must be in the set which we mostly use.
Upvotes: 14
Views: 15529
Reputation: 99
For MySQL 8 and above the below character-set and collation will work for multi-lingual data. charset = utf8mb4 collation = utf8mb4_unicode_520_ci
Upvotes: 0
Reputation: 654
You can insert any language text in MySQL Table by changing the Collation of the table Field to 'utf8_general_ci '.It is case insensitive.
Upvotes: 0
Reputation: 1853
You should use a Unicode collation. You can set it by default on your system, or on each field of your tables. There are the following Unicode collation names, and this are their differences:
utf8_general_ci is a very simple collation. It just - removes all accents - then converts to upper case and uses the code of this sort of "base letter" result letter to compare.
utf8_unicode_ci uses the default Unicode collation element table.
The main differences are:
utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in the wrong order.
+/- The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
So depending on, if you know or not, which specific languages/characters you are going to use I do recommend that you use utf8_unicode_ci which has a more ample coverage.
Extracted from MySQL forums.
Upvotes: 22
Reputation: 13620
UTF-8
encompasses most languages, that's your safest bet. However, there are exceptions, and you need to make sure all languages you want to cover work in UTF-8. My experience with storing character sets MySQL doesn't understand, is that it will not be able to sort properly, but the data has remained intact as long as I read it out in the same character encoding I wrote it in.
UTF-8
is the character encoding, a way of storing a number. Which character is represented by which number is Unicode
- an important distinction. Unicode has a large number of languages it covers and UTF-8
can encode them all (0 to 10FFFF, sort of), but Java can't handle all since the VM internal representation is a 16-bit character (not that you care about Java :).
Upvotes: 1