Reputation: 747
I'm trying to take foreign language text and output a human-readable, filename-safe equivalent. After looking around, it seems like the best option is unicodedata.normalize()
, but I can't get it to work. I've tried putting the exact code from some answers here and elsewhere, but it keeps giving me this error. I only got one success, when I ran:
unicodedata.normalize('NFD', '\u00C7')
'C\u0327'
But every other time, I get an error. Here's my code I've tried:
unicodedata.normalize('NFKD', u'\u2460') #error, not sure why. Look same as above.
s = 'ذهب الرجل'
unicodedata.normalize('NKFC',s) #error
unicodedata.normalize('NKFD', 'ñ') #error
Specifically, the error I get is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid normalization form
I don't understand why this isn't working. All of these are strings, which means they are unicode in Python 3. I tried encoding them using .encode()
, but then normalize()
said it only takes arguments of string, so I know that can't be it. I'm seriously at a loss because even code I'm copying from here seems to error out. What's going on here?
Upvotes: 3
Views: 2772
Reputation: 120628
Looking at unicodedata.c, the only way you can get that error is if you enter an invalid form string. The valid values are "NFC", "NFKC", "NFD", and "NFKD", but you seem to be using values with the "F" and "K" switched around:
>>> import unicodedata
>>>
>>> unicodedata.normalize('NKFD', 'ñ')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid normalization form
>>>
>>> unicodedata.normalize('NFKD', 'ñ')
'ñ'
Upvotes: 7