Python unicode os.listdir() not returning results from API

Question

I'm using TMDb to look up media based on filename. Most of the time this works fine, except when I use os.listdir() to search for files with Unicode chars in the name. As far as I can tell, TMDb looks for results in unicode and returns the response in unicode as well.

Take for example a cover art file for Amélie:

Amélie.jpg

A simple controlled experiment shows that no matter what I try, it works when using typed Unicode strings, but not when using os.listdir().

# -*- coding: utf-8 -*- 

import os
import tmdbsimple as tmdb

tmdb.API_KEY = ''

print os.listdir('/media/artwork/')
print os.listdir(u'/media/artwork/')

print('
str controlled test')
s = tmdb.Search()
s.movie(query='Amélie')
print('Results', len(s.results))
for r in s.results:
    print(r)  

print('
unicode controlled test')
s = tmdb.Search()
s.movie(query=u'Amélie')
print('Results', len(s.results))
for r in s.results:
    print(r)  

print('
str listdir')
for file in os.listdir('/media/artwork/'):
    s = tmdb.Search()
    s.movie(query=os.path.splitext(file)[0])
    print('Results', len(s.results))
    for r in s.results:
        print(r)

print('
unicode listdir')
for file in os.listdir(u'/media/artwork/'):
    s = tmdb.Search()
    s.movie(query=os.path.splitext(file)[0])
    print('Results', len(s.results))
    for r in s.results:
        print(r)

Outputs:

['Ame\xcc\x81lie.jpg']
['u'Ame\u0301lie.jpg']

str controlled test
('Results', 8)
{u'poster_path': u'/pM20xF4WFyX7G3ie0YBXFp75aEC.jpg', u'title': u'Am\xe9lie' ... }

unicode controlled test
('Results', 8)
{u'poster_path': u'/pM20xF4WFyX7G3ie0YBXFp75aEC.jpg', u'title': u'Am\xe9lie' ... }

str listdir
('Results', 0)

unicode listdir
('Results', 0)

So why is the raw string consistently working, ASCII or Unicode, and the filename pulled from the filesystem is not?

I've tried:

encode('utf-8') and decode('utf-8') in all myriad of combinations
using u'' prefix in all the file loading
reloading sys with utf-8 encoding
I came across a post from Martijn Pieters about Mac OS handling Unicode differently, but I can't seem to find it again
isinstance(file, str) (surprise, it's not unicode!)

So... how can I get a folder enumeration to work with unicode chars, OR is there a proper way I can convert these ascii filenames to unicode without the dreaded ordinal out of range error?

Mark Tolonen · Accepted Answer

The difference is your file system is using decomposed Unicode characters. If you normalize the filenames returned to composed Unicode characters, it would work \xe9 is the Unicode character é. and e\u0301 is an ASCII e followed by a combining accent:

>>> u'Am\xe9lie' == ud.normalize('NFC',u'Ame\u0301lie')
True

So use:

import unicodedata as ud
print('
unicode listdir')
for filename in os.listdir(u'/media/artwork/'):
    nfilename = ud.normalize(filename)
    s = tmdb.Search()
    s.movie(query=os.path.splitext(nfilename)[0])
    print('Results', len(s.results))
    for r in s.results:
        print(r)

Python unicode os.listdir() not returning results from API

Answers (1)

Related Questions