Dervin Thunk
Dervin Thunk

Reputation: 20129

Information on rationale for unicode codepoint sorting?

I was looking into https://github.com/JuliaLang/utf8proc/blob/master/utf8proc.c#L397, in particular this snippet:

if (property1->combining_class > property2->combining_class &&
          property2->combining_class > 0) {
        buffer[pos] = uc2;
        buffer[pos+1] = uc1;

Obviously, there's some "sorting" going on, but I cannot find the rationale for that sorting on the Unicode site (I simply don't know how to search for it).

Is there some rationale on why certain properties come before others, or it is simply one "canonical" ordering?

Upvotes: 0

Views: 78

Answers (1)

一二三
一二三

Reputation: 21249

This is the "Canonical Ordering Algorithm" that defines a single order for the combining marks that follow a character. The problem is that characters with multiple combining marks in different positions (e.g., above and below the character) can be specified in multiple ways:

ṩ: U+0073 U+0323 U+0307

ṩ: U+0073 U+0307 U+0323

When normalising text, the canonical ordering algorithm makes sure that ṩ can only appear in the first order: the dot below before the dot above. The reason for this particular ordering (below before above) is a bit arbitrary due to the number of ways that characters can combine, but it follows the order in this table—which appears to be generally be bottom-to-top, left-to-right.

The complete specification of the algorithm is given in section 3.11 of the Unicode Standard.

Upvotes: 1

Related Questions