GIZ
GIZ

Reputation: 4643

What is the mechanism that the builtin function `open` uses to encode and decode filenames?

I have a little confusion about open. I'm running Windows 10, when I call sys.getfilesystemencoding I get mbcs so if I pass the filename to open for example:

open('Meow!.txt')

Supposedly, the encoding for the source file is utf-8. Does open encodes the filename 'Meow!.txt' with mbcs encoding which is set to the default Windows ANSI codepage? And then passes the requests to the OS?

Upvotes: 3

Views: 107

Answers (1)

GIZ
GIZ

Reputation: 4643

Here's what happens internally when using the builtin open in 2.7 to be precise:

Python sets a constant that names the default encoding of filenames, this constant is called Py_FileSystemDefaultEncoding and varies per-platform. Ultimately, when its value is set to Null, Python will try to get the default encoding of the platform if there's any:

 /*bltinmodule.c*/

/* The default encoding used by the platform file system APIs
   Can remain NULL for all platforms that don't have such a concept
*/

    #if defined(MS_WINDOWS) && defined(HAVE_USABLE_WCHAR_T)
    const char *Py_FileSystemDefaultEncoding = "mbcs";
    #elif defined(__APPLE__)
    const char *Py_FileSystemDefaultEncoding = "utf-8";
    #else
    const char *Py_FileSystemDefaultEncoding = NULL; /* use default */
    #endif

Py_FileSystemDefaultEncoding uses "mbcs" (Multi-byte-character-set) Windows encoding, you can check the value of Py_FileSystemDefaultEncoding using sys.getfilesystemencoding() call:

Python 2.7 Documentation: sys.getfilesystemencoding()

On Windows NT+, file names are Unicode natively, so no conversion is performed. getfilesystemencoding() still returns 'mbcs', as this is the encoding that applications should use when they explicitly want to convert Unicode strings to byte strings that are equivalent when used as file names.

So for example let's assume when a filename with Chinese characters, for simplicity I'm going to use U+5F08 Chinese chess CJK for the filename that I'm going to write:

>>> f = open(u'\u5F08.txt', 'w')
>>> f
<open file u'\u5f08', mode 'w' at 0x000000000336B1E0>
  • Generally speaking, what happens when you pass filename to open as unicode in 2.X and str in 3.X?

This answer is platform-dependent. For instance, in Windows, there's no need to convert Unicode strings to any encoding, not even with the default filesystem encoding "mbcs", to prove that:

>>> f = open(u'\u5F08.txt'.encode('mbcs'), 'w')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 22] invalid mode ('w') or filename: '?.txt'

By the way, even if you use 'utf-8' encoding, you'll not get the correct filename:

>>> f = open(u'\u5F08.txt'.encode('utf8'), 'w')

This will give you 弈.txt filename if you check that on Windows instead of 弈.txt. In conclusion, there's no conversion for Unicode filenames apparently. I think this rule applies to str too. Since str in 2.X is a raw byte string, Python won't pick encoding magically **I cannot verify this however and it might be possible that Python will decode str with "mbcs" encoding. It's possible to verify that I believe by using characters outside "mbcs" code pages character set, but this is again will depend on your Windows locale settings. So much is encapsulated at the lower level in Windows implementation. If memory serves, I think "mbcs" now is considered legacy for Windows APIs. Python 3.6 uses UTF-8 instead, unless the legacy mode is enabled.

Really though, it seems the issue is deep into Windows APIs and their implementation, rather than the implementation of Python itself.

Upvotes: 1

Related Questions