mingxiao
mingxiao

Reputation: 1802

Set of unicode characters that do not have the same NFD and NFC encoding

What are the set of unicode characters that do not share the same NFC and NFD encoding?

For example 日本, in NFD and NFC are both u'\u65e5\u672c'

However のご賛同をいただき ました

in NFD: u'\u306e\u3053\u3099\u8cdb\u540c\u3092\u3044\u305f\u305f\u3099\u304d \u307e\u3057\u305f'

in NFC: u'\u306e\u3054\u8cdb\u540c\u3092\u3044\u305f\u3060\u304d \u307e\u3057\u305f'

(Definitions of NFD and NFC: https://en.wikipedia.org/wiki/Unicode_normalization#Normal_forms)

Upvotes: 0

Views: 163

Answers (1)

nwellnhof
nwellnhof

Reputation: 33638

NFC is performed by first decomposing a string, then recomposing some character sequences. So the set of single characters for which the result of NFC and NFD differs is every character which has a decomposition mapping in the UCD and is not excluded from composition. These characters are also called primary composites.

Note that this only applies to single characters. If you're considering sequences of multiple characters, things get a lot more complicated. For example, a sequence of two characters for which the NFC and NFD forms are identical when applied separately can have different forms when applied to the entire sequence.

Upvotes: 1

Related Questions