Reputation: 3136
Unicode emphasizes that software should be as forward compatible as possible, by defaulting to treating unassigned characters as if they were a private use code point. This works well in most cases, as most new characters do not change when normalized, case folded, etc.
However, I want to analyze normalization "breaking" changes in Unicode: characters which have properties that would result in changes when applying NFx, NFKx, casefold, or NFKC_Casefold normalization. I'm not 100% confident in my understanding of the NFC or NFKC algorithms, and I believe that there have been some stability changes which limit the number of special cases. I can limit my analysis to Unicode 4, 5, or even 6 if it means not having to deal with special cases.
My initial stab at this parses the XML Unicode Character Database and selects code points based on the canonical combining class) (ccc != 0
), NFxy quick check (NFC_QC != 'Y'
, NFD_QC != 'Y'
, etc), and casefolding/NFKC_Casefold (CWKCF = 'Y' or CWCF = 'Y'
) properties.
Is this the best approach, or should I just be looking at the decomposition mapping and type?
Upvotes: 3
Views: 88