aug
aug

Reputation: 11714

Intl.Collator sorting Japanese - Why does collator not prioritize Japanese characters first?

Working with a friend we were diving into sorting and we wanted to use the best practices of Intl.Collator compare to sort based on the locale.

For all locales, this works as expected where characters or text of the language is prioritized over others but Japanese seems to be different.

function letterSort(lang, letters) {
  letters.sort(new Intl.Collator(lang).compare);
  return letters;
}

console.log('EN sort:');
console.log(letterSort('en', ['a', '手に', '大人', 'b', '学校', '#', '金魚', 'きんぎょ', 'キンギョ']));

// =>["#", "a", "b", "きんぎょ", "キンギョ", "大人", "学校", "手に", "金魚"]

console.log('ZH sort:');
console.log(letterSort('zh', ['a', '手に', '大人', 'b', '学校', '#', '金魚', 'きんぎょ', 'キンギョ']));

// => ["#", "大人", "金魚", "手に", "学校", "a", "b", "きんぎょ", "キンギョ"]

console.log('JP sort:');
console.log(letterSort('ja', ['a', '手に', '大人', 'b', '学校', '#', '金魚', 'きんぎょ', 'キンギョ']));

// => ["#", "a", "b", "きんぎょ", "キンギョ", "大人", "学校", "手に", "金魚"]

In the above snippet you'll notice for English and Chinese, both respectively will prioritize their own text. Japanese, however, doesn't.

After some digging, I was able to discover there is an ICU Project Demo and it seems a similar ordering is done and is enforced by ICU. It also seems like ordering in Japanese is a tough problem.

enter image description here

My coworker posted this as the takeaway and the article I feel touches it a little

ok I think I understand the problem better, basically Japanese has four valid character sets one of them being roman characters so sorting in Japanese will sort each character set within itself and not amongst each other. And roman characters come first of the four sets (cause unicode).

^ Is that explanation correct? Or is there a better more appropriate way for ordering Japanese where Japanese characters get prioritized first (sounds like that is bad practice though but I'm surprised Japanese people are okay with having their own language at the end of sorts). The article talks about the problem in detail as well but I'm not sure if there are new found ways of ordering Japanese

Upvotes: 2

Views: 721

Answers (1)

IliasT
IliasT

Reputation: 4301

Since there are four valid character sets for Japanese as a language, the sorting only happens within each of the sets and then each set is ordered relatively to one another in a predetermined way:

  1. Rōmaji
  2. Katakana
  3. Hiragana
  4. Kanji

Note: Romaji, is just the Roman character set.

You can try it yourself:

function letterSort(lang, letters) {
  letters.sort(new Intl.Collator(lang).compare);
  return letters;
}

const kanji = ['南', '北', '打'];
const hiragana = ['ぬ', 'ち', 'よ'];
const katakana = ['シ', 'イ', 'ホ'];
const romaji = ['a', 'c', 'b'];

console.log(letterSort('ja', [...kanji, ...hiragana, ...katakana, ...romaji]))

We do see that the result is in line with what'd we'd expect: the character sets first sort themselves relative to one another, while more granular sorting occurs only within each character set.

Upvotes: 2

Related Questions