Reputation: 327
I want to read files with special file names in Python (2.7). But whatever I try, it always fails to open them. The filenames are
F\xA8\xB9hrerschein
and
Gro\xDFhandel
I know, the encoding was done with one of several codepages. I could try to find out which one and try to convert it and all the mumbo jumbo, but I don't want that.
Can't I somehow tell python to open that file without having to go through all that encoding stuff? I mean opening the file by its raw name in bytes?
Upvotes: 2
Views: 3072
Reputation: 27714
Under Linux, filenames can be encoded in any character encoding. When opening a file, you must use the exact name encoded to match.
I.e. If the filename is Großhandel.txt
encoded using UTF-8, it must be encoded as Gro\xc3\x9fhandel.txt
.
If you pass a Unicode string to open()
, the user's locale is used to encode the filename, which may match the filename.
Under OS X, UTF-8 encoding is enforced. Under Windows, the character encoding is abstracted by the i/o drivers. A Unicode object passed to open()
should always be used for these Operating Systems, where it'll be converted appropriately.
If you're reading filenames from the filesystem, it would be useful to get decoded Unicode filenames to pass straight to open()
- Well, you can pass Unicode strings to os.listdir()
.
E.g.
Locale: LANG=en_GB.UTF-8
A directory with the following files, with their filenames encoded to UTF-8:
test.txt
€.txt
When running Python 2.7 using a string:
>>> os.listdir(".")
['\xe2\x82\xac.txt', 'test.txt']
Using a Unicode path:
>>> os.listdir(u".")
[u'\u20ac.txt', u'test.txt']
Upvotes: 0
Reputation: 327
After all, I fixed it with
reload(sys)
sys.setdefaultencoding('utf-8')
and setting the environment variable
LANG="C.UTF-8"
Thanks for the hints.
Upvotes: 1
Reputation: 43497
If you have source code like
with open('Großhandel') as input:
#stuff
You should look at Source Code Encodings and write
#!python2
# -*- coding: utf-8 -*-
with open('Großhandel') as input:
…
It is worth mention that the authors of PEP-263 are Marc-André Lemburg and Martin von Löwis, which I suppose makes pushing defined toward source encoding back in 2002 slightly more understandable.
Upvotes: 0
Reputation: 1970
One way is to use os.listdir()
. See the following example.
Add some data to a file with non-ascii character 0xdf in the name:
$ echo abcd > `printf "A\xdfA"`
Check that the file contains a non-ascii character:
$ ls A*
A?A
Start Python, read the directory and open the first file (which is the one with the non-ascii character):
$ Python
>>> import os
>>> d = os.listdir('.')
>>> d
['A\xdfA']
>>> f = open(d[0])
>>> f.readline()
'abcd\n'
>>>
Upvotes: 0