fr00tyl00p
fr00tyl00p

Reputation: 327

File name encoding in Python 2.7

I want to read files with special file names in Python (2.7). But whatever I try, it always fails to open them. The filenames are

F\xA8\xB9hrerschein

and

Gro\xDFhandel

I know, the encoding was done with one of several codepages. I could try to find out which one and try to convert it and all the mumbo jumbo, but I don't want that.

Can't I somehow tell python to open that file without having to go through all that encoding stuff? I mean opening the file by its raw name in bytes?

Upvotes: 2

Views: 3072

Answers (4)

Alastair McCormack
Alastair McCormack

Reputation: 27714

Under Linux, filenames can be encoded in any character encoding. When opening a file, you must use the exact name encoded to match.

I.e. If the filename is Großhandel.txt encoded using UTF-8, it must be encoded as Gro\xc3\x9fhandel.txt.

If you pass a Unicode string to open(), the user's locale is used to encode the filename, which may match the filename.

Under OS X, UTF-8 encoding is enforced. Under Windows, the character encoding is abstracted by the i/o drivers. A Unicode object passed to open() should always be used for these Operating Systems, where it'll be converted appropriately.

If you're reading filenames from the filesystem, it would be useful to get decoded Unicode filenames to pass straight to open() - Well, you can pass Unicode strings to os.listdir().

E.g.

Locale: LANG=en_GB.UTF-8

A directory with the following files, with their filenames encoded to UTF-8:

test.txt
€.txt

When running Python 2.7 using a string:

>>> os.listdir(".")
['\xe2\x82\xac.txt', 'test.txt']

Using a Unicode path:

>>> os.listdir(u".")
[u'\u20ac.txt', u'test.txt']

Upvotes: 0

fr00tyl00p
fr00tyl00p

Reputation: 327

After all, I fixed it with

reload(sys)
sys.setdefaultencoding('utf-8')

and setting the environment variable

LANG="C.UTF-8"

Thanks for the hints.

Upvotes: 1

msw
msw

Reputation: 43497

If you have source code like

with open('Großhandel') as input:
    #stuff

You should look at Source Code Encodings and write

 #!python2
 # -*- coding: utf-8 -*-
 with open('Großhandel') as input:
 …

It is worth mention that the authors of PEP-263 are Marc-André Lemburg and Martin von Löwis, which I suppose makes pushing defined toward source encoding back in 2002 slightly more understandable.

Upvotes: 0

NZD
NZD

Reputation: 1970

One way is to use os.listdir(). See the following example.

Add some data to a file with non-ascii character 0xdf in the name:

$ echo abcd > `printf "A\xdfA"`

Check that the file contains a non-ascii character:

$ ls A*
A?A

Start Python, read the directory and open the first file (which is the one with the non-ascii character):

$ Python
>>> import os
>>> d = os.listdir('.')
>>> d
['A\xdfA']
>>> f = open(d[0])
>>> f.readline()
'abcd\n'
>>> 

Upvotes: 0

Related Questions