unicodedata.normalize is missing one character doing conversion

Question

I'm trying to rename files using the below script, but I'm having problems with catching the following "Don’t" which should end up as "Don't". Any ideas on how I can do this?

def remove_accents(s): 
    nkfd_form = unicodedata.normalize('NFKD', s) 
    return u''.join([c for c in nkfd_form if not unicodedata.combining(c)])

for fname in glob.glob("**/*.mp3", recursive=True):
    new_fname = remove_accents(fname)
    if new_fname != fname:
        try:
            print ('renaming non-ascii filename to', new_fname)
            os.rename(fname, new_fname)
        except Exception as e:
            print (e)

wim · Accepted Answer

Wrong tool for the job - unicodedata.normalize is not about removing accents at all.

For down-converting to ascii, look instead at unidecode:

>>> from unidecode import unidecode
>>> unidecode("Don’t")
"Don't"

unicodedata.normalize is missing one character doing conversion

Answers (1)

Related Questions