brandonscript
brandonscript

Reputation: 72905

Python unicode os.listdir() not returning results from API

I'm using TMDb to look up media based on filename. Most of the time this works fine, except when I use os.listdir() to search for files with Unicode chars in the name. As far as I can tell, TMDb looks for results in unicode and returns the response in unicode as well.

Take for example a cover art file for Amélie:

Amélie.jpg

A simple controlled experiment shows that no matter what I try, it works when using typed Unicode strings, but not when using os.listdir().

# -*- coding: utf-8 -*- 

import os
import tmdbsimple as tmdb

tmdb.API_KEY = '<your-key-here>'

print os.listdir('/media/artwork/')
print os.listdir(u'/media/artwork/')

print('\nstr controlled test')
s = tmdb.Search()
s.movie(query='Amélie')
print('Results', len(s.results))
for r in s.results:
    print(r)  

print('\nunicode controlled test')
s = tmdb.Search()
s.movie(query=u'Amélie')
print('Results', len(s.results))
for r in s.results:
    print(r)  

print('\nstr listdir')
for file in os.listdir('/media/artwork/'):
    s = tmdb.Search()
    s.movie(query=os.path.splitext(file)[0])
    print('Results', len(s.results))
    for r in s.results:
        print(r)

print('\nunicode listdir')
for file in os.listdir(u'/media/artwork/'):
    s = tmdb.Search()
    s.movie(query=os.path.splitext(file)[0])
    print('Results', len(s.results))
    for r in s.results:
        print(r)

Outputs:

['Ame\xcc\x81lie.jpg']
['u'Ame\u0301lie.jpg']

str controlled test
('Results', 8)
{u'poster_path': u'/pM20xF4WFyX7G3ie0YBXFp75aEC.jpg', u'title': u'Am\xe9lie' ... }

unicode controlled test
('Results', 8)
{u'poster_path': u'/pM20xF4WFyX7G3ie0YBXFp75aEC.jpg', u'title': u'Am\xe9lie' ... }

str listdir
('Results', 0)

unicode listdir
('Results', 0)

So why is the raw string consistently working, ASCII or Unicode, and the filename pulled from the filesystem is not?

I've tried:

So... how can I get a folder enumeration to work with unicode chars, OR is there a proper way I can convert these ascii filenames to unicode without the dreaded ordinal out of range error?

Upvotes: 0

Views: 1509

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 177725

The difference is your file system is using decomposed Unicode characters. If you normalize the filenames returned to composed Unicode characters, it would work \xe9 is the Unicode character é. and e\u0301 is an ASCII e followed by a combining accent:

>>> u'Am\xe9lie' == ud.normalize('NFC',u'Ame\u0301lie')
True

So use:

import unicodedata as ud
print('\nunicode listdir')
for filename in os.listdir(u'/media/artwork/'):
    nfilename = ud.normalize(filename)
    s = tmdb.Search()
    s.movie(query=os.path.splitext(nfilename)[0])
    print('Results', len(s.results))
    for r in s.results:
        print(r)

Upvotes: 2

Related Questions