Ivan
Ivan

Reputation: 538

In which cases normalize('NFKC') method work?

I tried to use normalize('NFKC') method with different characters, but it didn't work. Fortunately, can't say this for NFC. When it's possible normalize('NFC') always replaces multiple codepoints with the single one. For example:

let t1 = `\u00F4`; //ô
let t2 = `\u006F\u0302`; //ô
console.log(t2.normalize('NFC') == t1); //true

And here's example with NFKC that never works:

let s1 = '\uFB00'; //"ff"
let s2 = '\u0066\u0066'; //"ff"
console.log(s2.normalize('NFKC') == s1); //false

I thought before that NFKC replaces multiple codepoints with the single one that represents compatible character. To put it simple, I thought that NFKC will replace \u0066\u0066 with \uFB00.

If NFKC doesn't work like that, then... how does it work?

Upvotes: 3

Views: 3017

Answers (1)

Ivan
Ivan

Reputation: 538

The thing is NFKC (as well as NFKD) supports compatible and canonically equivalent normalization.

Unicode

The type of full decomposition chosen depends on which Unicode Normalization Form is involved. For NFC or NFD, one does a full canonical decomposition, which makes use of only canonical Decomposition_Mapping values. For NFKC or NFKD, one does a full compatibility decomposition, which makes use of canonical and compatibility Decomposition_Mapping values.

And that's completely understandable because as MDN says:

All canonically equivalent sequences are also compatible, but not vice versa.

But it's also worth to notice that NFKC makes compatible and canonically equivalent normalizations in different ways. Canonically equivalent normalization by NFKC is produced the same way as NFC. For example:

//"ô" (U+00F4) -> "a" (U+006F) + " ̂" (U+0302) -> "â" (U+00F4)
let c1 = `\u006F\u0302`; //ô
console.log(c1.normalize('NFKC').length); //1

But compatible normalization by this parameter works differently. The spec is saying:

Normalization Form KC does not attempt to map character sequences to compatibility composites. For example, a compatibility composition of “office” does not produce “o\uFB03ce”, even though “\uFB03” is a character that is the compatibility equivalent of the sequence of three characters “ffi”. In other words, the composition phase of NFC and NFKC are the same—only their decomposition phase differs, with NFKC applying compatibility decompositions.

For example:

//"ff"(U+FB00) -> "f"(U+0066) + "i"(U+0066) -> "f"(U+0066) + "i"(U+0066)
let c2 = '\u0066\u0066'; //ff
console.log(c2.normalize('NFKC').length); //2

Upvotes: 6

Related Questions