Reputation: 87221
How do I convert the output of os.listdir
to a list of bytes
(from a list of Unicode str
s)? It has to work even if the filename is invalid UTF-8, for example:
$ locale
LANG=
LANGUAGE=
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> open(b'\x80', 'w')
<_io.TextIOWrapper name=b'\x80' mode='w' encoding='UTF-8'>
>>> os.listdir('.')
['\udc80']
>>> import sys
>>> [fn.encode(sys.getfilesystemencoding()) for fn in os.listdir('.')]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
>>> [... for fn in os.listdir('.')]
[b'\x80']
So what do I need to write to the ...
above to make it work?
Please note that it's not an option to rename the file, to use Python 2.x, or to use ASCII-only filenames in this case. I'm not looking for workarounds, I'm looking for the code in place of the ...
s.
Upvotes: 1
Views: 1961
Reputation: 177600
If you just want the filenames from os.listdir
in bytes, it has that option. From the docs:
path may be either of type
str
or of typebytes
. If path is of typebytes
, the filenames returned will also be of typebytes
; in all other circumstances, they will be of typestr
.
Upvotes: 3
Reputation: 1121614
Use an error handler; in this case the surrogateescape
error handler looks appropriate:
Value:
'surrogateescape'
Meaning:On decoding, replace byte with individual surrogate code ranging from
U+DC80to
U+DCFF. This code will then be turned back into the same byte when the
'surrogateescape'` error handler is used when encoding the data. (See PEP 383 for more.)
The os.fsencode()
utility function uses the latter option; it encodes to sys.getfilesystemencoding()
using the surrogate escape error handler when applicable for your OS:
Encode filename to the filesystem encoding with
'surrogateescape'
error handler, or'strict'
on Windows; returnbytes
unchanged.
In reality it'll use 'strict'
only when the filesystem encoding is mbcs
, see the os
module source, a codec only available on Windows.
Demo:
>>> import sys
>>> ld = ['\udc80']
>>> [fn.encode(sys.getfilesystemencoding()) for fn in ld]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed
>>> [fn.encode(sys.getfilesystemencoding(), 'surrogateescape') for fn in ld]
[b'\x80']
>>> import os
>>> [os.fsencode(fn) for fn in ld]
[b'\x80']
Upvotes: 4
Reputation: 87221
>>> [os.fsencode(fn) for fn in os.listdir('.')]
[b'\x80']
There is also a corresponding os.fsdecode
for conversion in the other direction.
Docs here: https://docs.python.org/3/library/os.html#os.fsencode
Upvotes: 3