Gromit
Gromit

Reputation: 63

unicodedata.normalize is missing one character doing conversion

I'm trying to rename files using the below script, but I'm having problems with catching the following "Don’t" which should end up as "Don't". Any ideas on how I can do this?

def remove_accents(s): 
    nkfd_form = unicodedata.normalize('NFKD', s) 
    return u''.join([c for c in nkfd_form if not unicodedata.combining(c)])

for fname in glob.glob("**/*.mp3", recursive=True):
    new_fname = remove_accents(fname)
    if new_fname != fname:
        try:
            print ('renaming non-ascii filename to', new_fname)
            os.rename(fname, new_fname)
        except Exception as e:
            print (e)

Upvotes: 1

Views: 489

Answers (1)

wim
wim

Reputation: 362786

Wrong tool for the job - unicodedata.normalize is not about removing accents at all.

For down-converting to ascii, look instead at unidecode:

>>> from unidecode import unidecode
>>> unidecode("Don’t")
"Don't"

Upvotes: 3

Related Questions