How to handle length changes due to normalization (NFKC for my use-case)?

Question

Normalization does not always result in a 1-1 mapping of characters. Characters like 'ﬁ' will break into 'fi' and some Japanese/Chinese characters can combine into a single character. I need a way to map offsets between the normalized and original strings. Is there any library or method to solve this issue accurately?

Using approximations by finding surrounding characters which are not affected by the normalization such as English letters and whitespaces helps, but it is not accurate enough.

How to handle length changes due to normalization (NFKC for my use-case)?

Answers (0)

Related Questions