Bernd
Bernd

Reputation: 3418

Convert python filenames to unicode

I am on python 2.6 for Windows.

I use os.walk to read a file tree. Files may have non-7-bit characters (German "ae" for example) in their filenames. These are encoded in Pythons internal string representation.

I am processing these filenames with Python library functions and that fails due to wrong encoding.

How can I convert these filenames to proper (unicode?) python strings?

I have a file "d:\utest\ü.txt". Passing the path as unicode does not work:

>>> list(os.walk('d:\\utest'))
[('d:\\utest', [], ['\xfc.txt'])]
>>> list(os.walk(u'd:\\utest'))
[(u'd:\\utest', [], [u'\xfc.txt'])]

Upvotes: 16

Views: 28242

Answers (6)

Pegasus
Pegasus

Reputation: 1593

os.walk(unicode(root_dir, 'utf-8'))

Upvotes: 4

Shourya Sarcar
Shourya Sarcar

Reputation: 61

I was looking for a solution for Python 3.0+. Will put it up here incase someone else needs it.

rootdir = r'D:\COUNTRY\ROADS\'
fs_enc = sys.getfilesystemencoding()
for (root, dirname, filename) in os.walk(rootdir.encode(fs_enc)):
    # do your stuff here, but remember that now
    # root, dirname, filename are represented as bytearrays

Upvotes: 6

gatoatigrado
gatoatigrado

Reputation: 16850

a more direct way might be to try the following -- find your file system's encoding, and then convert that to unicode. for example,

unicode_name = unicode(filename, "utf-8", errors="ignore")

to go the other way,

unicode_name.encode("utf-8")

Upvotes: 4

RichieHindle
RichieHindle

Reputation: 281495

If you pass a Unicode string to os.walk(), you'll get Unicode results:

>>> list(os.walk(r'C:\example'))          # Passing an ASCII string
[('C:\\example', [], ['file.txt'])]
>>> 
>>> list(os.walk(ur'C:\example'))        # Passing a Unicode string
[(u'C:\\example', [], [u'file.txt'])]

Upvotes: 47

Lennart Regebro
Lennart Regebro

Reputation: 172249

No, they are not encoded in Pythons internal string representation, there is no such thing. They are encoded in the encoding of the operating system/file system. Passing in unicode works for os.walk though.

I don't know how os.walk behaves when filenames can't be decoded, but I assume that you'll get a string back, like with os.listdir(). In that case you'll again have problems later. Also, not all of Python 2.x standard library will accept unicode parameters properly, so you may need to encode them as strings anyway. So, the problem may in fact be somewhere else, but you'll notice if that is the case. ;-)

If you need more control of the decoding you can't always pass in a string, and then just decode it with filename = filename.decode() as usual.

Upvotes: 2

Roger Pate
Roger Pate

Reputation:

os.walk isn't specified to always use os.listdir, but neither is it listed how Unicode is handled. However, os.listdir does say:

Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a Unicode object, the result will be a list of Unicode objects. Undecodable filenames will still be returned as string objects.

Does simply using a Unicode argument work for you?

for dirpath, dirnames, filenames in os.walk(u"."):
  print dirpath
  for fn in filenames:
    print "   ", fn

Upvotes: 2

Related Questions