Reputation: 72905
I'm using TMDb to look up media based on filename. Most of the time this works fine, except when I use os.listdir()
to search for files with Unicode chars in the name. As far as I can tell, TMDb looks for results in unicode and returns the response in unicode as well.
Take for example a cover art file for Amélie
:
Amélie.jpg
A simple controlled experiment shows that no matter what I try, it works when using typed Unicode strings, but not when using os.listdir()
.
# -*- coding: utf-8 -*-
import os
import tmdbsimple as tmdb
tmdb.API_KEY = '<your-key-here>'
print os.listdir('/media/artwork/')
print os.listdir(u'/media/artwork/')
print('\nstr controlled test')
s = tmdb.Search()
s.movie(query='Amélie')
print('Results', len(s.results))
for r in s.results:
print(r)
print('\nunicode controlled test')
s = tmdb.Search()
s.movie(query=u'Amélie')
print('Results', len(s.results))
for r in s.results:
print(r)
print('\nstr listdir')
for file in os.listdir('/media/artwork/'):
s = tmdb.Search()
s.movie(query=os.path.splitext(file)[0])
print('Results', len(s.results))
for r in s.results:
print(r)
print('\nunicode listdir')
for file in os.listdir(u'/media/artwork/'):
s = tmdb.Search()
s.movie(query=os.path.splitext(file)[0])
print('Results', len(s.results))
for r in s.results:
print(r)
Outputs:
['Ame\xcc\x81lie.jpg']
['u'Ame\u0301lie.jpg']
str controlled test
('Results', 8)
{u'poster_path': u'/pM20xF4WFyX7G3ie0YBXFp75aEC.jpg', u'title': u'Am\xe9lie' ... }
unicode controlled test
('Results', 8)
{u'poster_path': u'/pM20xF4WFyX7G3ie0YBXFp75aEC.jpg', u'title': u'Am\xe9lie' ... }
str listdir
('Results', 0)
unicode listdir
('Results', 0)
So why is the raw string consistently working, ASCII or Unicode, and the filename pulled from the filesystem is not?
I've tried:
sys
with utf-8 encodingisinstance(file, str)
(surprise, it's not unicode!)So... how can I get a folder enumeration to work with unicode chars, OR is there a proper way I can convert these ascii filenames to unicode without the dreaded ordinal out of range
error?
Upvotes: 0
Views: 1509
Reputation: 177725
The difference is your file system is using decomposed Unicode characters. If you normalize the filenames returned to composed Unicode characters, it would work \xe9
is the Unicode character é
. and e\u0301
is an ASCII e
followed by a combining accent:
>>> u'Am\xe9lie' == ud.normalize('NFC',u'Ame\u0301lie')
True
So use:
import unicodedata as ud
print('\nunicode listdir')
for filename in os.listdir(u'/media/artwork/'):
nfilename = ud.normalize(filename)
s = tmdb.Search()
s.movie(query=os.path.splitext(nfilename)[0])
print('Results', len(s.results))
for r in s.results:
print(r)
Upvotes: 2