Reputation: 4643
I have a little confusion about open
. I'm running Windows 10, when I call sys.getfilesystemencoding
I get mbcs
so if I pass the filename to open
for example:
open('Meow!.txt')
Supposedly, the encoding for the source file is utf-8. Does open
encodes the filename 'Meow!.txt'
with mbcs
encoding which is set to the default Windows ANSI codepage? And then passes the requests to the OS?
Generally speaking, what happens when you pass filename to open
as unicode in 2.X and str
in 3.X?
Is it true when the filename is passed as a bytes
object in 3.X or str
in 2.X, overrides the default automatic encoding of the filename?
Upvotes: 3
Views: 107
Reputation: 4643
Here's what happens internally when using the builtin open
in 2.7 to be precise:
Python sets a constant that names the default encoding of filenames, this constant is called Py_FileSystemDefaultEncoding
and varies per-platform. Ultimately, when its value is set to Null, Python will try to get the default encoding of the platform if there's any:
/*bltinmodule.c*/
/* The default encoding used by the platform file system APIs
Can remain NULL for all platforms that don't have such a concept
*/
#if defined(MS_WINDOWS) && defined(HAVE_USABLE_WCHAR_T)
const char *Py_FileSystemDefaultEncoding = "mbcs";
#elif defined(__APPLE__)
const char *Py_FileSystemDefaultEncoding = "utf-8";
#else
const char *Py_FileSystemDefaultEncoding = NULL; /* use default */
#endif
Py_FileSystemDefaultEncoding
uses "mbcs" (Multi-byte-character-set) Windows encoding, you can check the value of Py_FileSystemDefaultEncoding
using sys.getfilesystemencoding()
call:
Python 2.7 Documentation:
sys.getfilesystemencoding()
On Windows NT+, file names are Unicode natively, so no conversion is performed.
getfilesystemencoding()
still returns 'mbcs', as this is the encoding that applications should use when they explicitly want to convert Unicode strings to byte strings that are equivalent when used as file names.
So for example let's assume when a filename with Chinese characters, for simplicity I'm going to use U+5F08 Chinese chess CJK for the filename that I'm going to write:
>>> f = open(u'\u5F08.txt', 'w')
>>> f
<open file u'\u5f08', mode 'w' at 0x000000000336B1E0>
open
as unicode in 2.X and str
in 3.X?This answer is platform-dependent. For instance, in Windows, there's no need to convert Unicode strings to any encoding, not even with the default filesystem encoding "mbcs", to prove that:
>>> f = open(u'\u5F08.txt'.encode('mbcs'), 'w')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 22] invalid mode ('w') or filename: '?.txt'
By the way, even if you use 'utf-8' encoding, you'll not get the correct filename:
>>> f = open(u'\u5F08.txt'.encode('utf8'), 'w')
This will give you 弈.txt filename if you check that on Windows instead of 弈.txt. In conclusion, there's no conversion for Unicode filenames apparently. I think this rule applies to str
too. Since str
in 2.X is a raw byte string, Python won't pick encoding magically **I cannot verify this however and it might be possible that Python will decode str
with "mbcs" encoding. It's possible to verify that I believe by using characters outside "mbcs" code pages character set, but this is again will depend on your Windows locale settings. So much is encapsulated at the lower level in Windows implementation. If memory serves, I think "mbcs" now is considered legacy for Windows APIs. Python 3.6 uses UTF-8 instead, unless the legacy mode is enabled.
Really though, it seems the issue is deep into Windows APIs and their implementation, rather than the implementation of Python itself.
Upvotes: 1