Reputation: 3516

Why does NFKC normalize of all digits not work?

In JavaScript I am using NFKC normalization via String.prototype.normalize to normalize fullwidth to standard ASCII halfwidth characters.

'１'.normalize('NFKC') === '1'
> true

However, looking at more obscure digits like ૫ which is the digit 5 in Gujarati it does not normalize.

'૫'.normalize('NFKC') === '5'
> false

What am I missing?

Upvotes: 1

Answers (3)

yonarp

Reputation: 39

The NFKC which you are using here stands for Compatibility Decomposition followed by Canonical Composition, which in trivial english means first break things to smaller more often used symbols and then combine them to find the equivalent simpler character. For example 𝟘->0, ﬁ->fi (Codepoint ﬁ=64257). It does not do conversion to ASCII, for example in ख़(2393)-> ख़([2326, 2364])

Reference:https://unicode.org/reports/tr15/#Norm_Forms For simpler understanding:https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c

Upvotes: 1

Giacomo Catenazzi

Reputation: 9533

You are looking the wrong problem.

Unicode is main purpose is about encoding characters (without loosing information). Fonts and other programs should be able to interpret such characters and give a glyph (according combination code point, nearby characters, and other characteristics outside code points [like language, epoch, font characteristic [script and non-script, uppercase, italic, etc changes how to combine characters and ligature (and also glyph form).

There are two main normalization (canonical and compatible) [and two variant: decomposed, and composed when possible]. Canonical normalization remove unneeded characters (repetition) and order composing characters in a standard way. Compatible normalization remove "compatible characters": characters that are in Unicode just not to lose information on converting to and from other charset.

Some digits (like small 2 exponent) have compatible character as normal digit (this is a formatting question, unicode is not about formatting). But on the other cases, digits in different characters should be keep ad different characters.

That was about normalization.

But you want to get the numeric value of a unicode character (warning: it could depends on other characters, position, etc.).

Unicode database provides also such property.

With Javascript, you may use unicode-properties javasript package, which provide you also the function getNumericValue(codePoint). This packages seems to use efficient compression of database, but so I don't know how fast it could be. The database is huge.

Upvotes: 1

CharlotteBuff

Reputation: 4439

Unicode normalisation is meant for characters that are variants of each other, not for every set of characters that might have similar meanings.

The character ‘１’ (FULLWIDTH DIGIT ONE) is essentially just the character ‘1’ (DIGIT ONE) with slightly different styling and would not have been encoded if it was not necessary for compatibility. They are – in some contexts – completely interchangeable, so the former was assigned a decomposition mapping to the latter. The character ‘૫’ (GUJARATI DIGIT FIVE) does not have a decomposition mapping because it is not a variant of any other character; it is its own distinct thing.

You can consult the Unicode Character Database to see which characters decompose and which (i.e. most of them) don’t. The link to the tool you posted as part of your question shows you for example that ૫ does not change under any form of Unicode normalisation.

Upvotes: 1

Why does NFKC normalize of all digits not work?

Answers (3)

Related Questions