Reputation: 1903
I want to check if a string is already in NFC form. Currently I do:
unicodedata.normalize('NFC', s) == s
I am doing this for a large number of strings, so I would like to be efficient. The above method seems wasteful. It converts to NFC, and then does a string comparison.
Is there a more efficient way to do it? I have considered:
len(unicodedata.normalize('NFC', s)) == len(s)
This avoids the string comparison. But I am not sure this is always correct. This works if NFC normalization always changes the length of a non NFC string. Is that a valid assumption?
Any other ideas?
Upvotes: 5
Views: 2018
Reputation: 168
Since Python 3.8 it exposes the needed check. Quote from the Python docs:
unicodedata.is_normalized(form, unistr)
Return whether the Unicode string unistr is in the normal form 'form'. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
New in version 3.8.
I wanted everything to be in NFC, but checking for NFD (so i could convert only those) did not work: all NFC strings passed the NFD check! My solution was then to test if a string is not NFC, and if so then do the conversion.
Upvotes: 0
Reputation: 21249
Normalising doesn't necessarily change the length of a string. For example, 'Ω'
(U+2126) becomes 'Ω'
(U+03A9) after NFC.
There is a normalisation "quick check" property in the Unicode database to test whether a character is already normalised, but unfortunately Python's unicodedata
module doesn't expose it. However, unicodedata.normalize()
does use this property to avoid doing any extra work if the string is already normalised—it simply returns the input string.
To access this property, you will either need to compile a table yourself from the Unicode Character Database, or use a broader Unicode library with Python bindings (like PyICU).
Upvotes: 5