Multiple languages with utf8 in postgresql

Question

How exactly is one meant to seamlessly support all languages stored within postgres's utf8 character set? We seem to be required to specify a single language-specific collation along with the character set, such as en_US.utf8. If I'm not mistaken, we don't have the ability to store both English (en_US) and Chinese (zh_CN) in the same utf8 column, while maintaining any kind of meaningful collation behavior. If I define a column as en_US.utf8, how is it supposed to handle values containing Chinese (zh_CN) characters / byte sequences? The reality is that a single column value can contain multiple languages (ex: "Hello and 晚安"), and simply cannot be collated according to a single language.

Yes, I can physically store any character sequences; but what is the defined behavior for ordering on a en_US.utf8 column that contains English, German, Chinese, Japanese and Korean strings?

I understand that mysql's utf8mb4_unicode_ci collation isn't perfect, and that it is not following any set standard for how to collate the entire unicode set. I can already hear the anti-mysql crowd sighing about how mysql's language-agnostic collations are arbitrary, semantically meaningless, or even purely invalid. But the fact is, it works well enough, and fulfills the expectation that utf8 = multi-language unicode support.

Is postgres just being extremely stubborn with the fact that it's semantically incorrect to collate across the unicode spectrum? I know the developers are very strict when it comes to "doing things according to spec", but this inability to juggle multiple languages is frustrating to say the least. Am I missing something that solves the multi-language problem, or is the official stance that a single utf8 column can handle any language, but only one language at a time?

Laurenz Albe · Accepted Answer

You are right that there will never be a perfect way to collate strings across languages.

PostgreSQL has decided not to create its own collations but to use those provided by the operating system. The idea behind this is to avoid re-inventing the wheel and to reduce maintenance effort.
So the traditional PostgreSQL answer to your question would be: if you want a string collation that works reasonably well for strings in different languages, complain to your operating system vendor or pick an operating system that provides such a collation.

However, this approach has drawbacks that the PostgreSQL community is aware of:

Few – if any – people decide on an operating system based on the collation support it provides.
PostgreSQL's sorting behaviour depends on the underlying operating system, which leads to frequent questions by confused users on the mailing lists.
With some operating systems collation behaviour can change during an operating system upgrade, leading to corrupt database indexes (see for example this thread).

It may well be that PostgreSQL changes its approach; there have been repeated efforts to use ICU libraries instead of operating system collations (see for example this recent thread), which would mitigate some of these problems.

Multiple languages with utf8 in postgresql

Answers (1)

Related Questions