sifr_dot_in
sifr_dot_in

Reputation: 3623

what collation must i use utf8_general_ci or utf8_unicode_ci or any other, for all world languages?

We develop android app. The app accepts text from users and upload to server (mysql). Then this text is read by other users.

While testing i found that 'Hindi' (Indian) language gets inserted in the column as '????? '. Then after SO search, i changed the collation to utf8_general_ci.

I am new to collation. I want to let user input text in any language in the world and others get the access. What shall i do. Accuracy is must.

But i saw a comment where one says, "You should never, ever use utf8_general_ci. It simply doesn’t work. It’s a throwback to the bad old days of ASCII stooopeeedity from fifty years ago. Unicode case-insensitive matching cannot be done without the foldcase map from the UCD. For example, “Σίσυφος” has three different sigmas in it; or how the lowercase of “TSCHüẞ” is “tschüβ”, but the uppercase of “tschüβ” is “TSCHÜSS”. You can be right, or you can be fast. Therefore you must use utf8_unicode_ci, because if you don’t care about correctness, then it’s trivial to make it infinitely fast."

Upvotes: 3

Views: 4609

Answers (1)

nj_
nj_

Reputation: 2339

Your question title is asking about collations, but in the body you say:

I want to let user input text in any language in the world and others get the access.

So, I'm assuming that is what you're specifically after. To clarify, collations affect how MySQL compares strings with each other, but it's not the thing that ultimately opens up the possibility of storing unicode characters.

For storage you need to ensure that the character set is defined correctly. MySQL allows you to specify character set and collation values on a column level, but it also allows you to specify defaults on a table and database level. In general I'd advice setting defaults on a database and table level, and let MySQL handle the rest when defining columns. Note that if columns already exist with a different character set, then you'll need to investigate changing it. Depending on what you're using to communicate with MySQL, you may need to specify a character encoding to use against the connection too.

Note that utf8mb4 is an absolute must for the character set used, do not use just utf8.. you won't be able to store unicode characters that consume 4 bytes with UTF-8, such as emoji characters.

As for the collation to use, I don't have a recommendation really, as it sort of depends what you're aiming for, speed or accuracy. There is a fair amount of information around which covers the topic in other answers.

Upvotes: 3

Related Questions