Reputation: 2531
Say I had the following code-
homeDir = os.path.expanduser("~")
fullPath = homeDir + "/.config"
print fullPath
Would this code still function properly for someone in say, Japan, who's home directory was composed of Kanji?
My concern is that python won't know how to add the two languages together, or even know what to do with the foreign characters.
Upvotes: 4
Views: 140
Reputation: 414315
All strings in your code from the question are bytestrings (sequence of bytes). They can represent anything, including a text encoded in some character encoding.
homeDir = os.path.expanduser("~") # input bytestring, returns bytestring
fullPath = homeDir + "/.config" # add 2 bytestrings
print fullPath
The print
works but you may see garbage in console if it uses different character encoding. Otherwise the code will work for any language, foreign characters.
On Python 3 or if from __future__ import unicode_literals
is used, string literals are Unicode. In this case it should also works:
from __future__ import unicode_literals
homeDir = os.path.expanduser("~") # input Unicode, returns Unicode
fullPath = homeDir + "/.config" # add 2 Unicode strings
print(fullPath) # print Unicode
The print may fail (try to set appropriate PYTHONIOENCODING
in this case).
On POSIX systems, paths may contain arbitrary byte sequences (except zero byte) including those that can't be decoded using a file system encoding. From Python 3 docs:
In Python, file names, command line arguments, and environment variables are represented using the string type. On some systems, decoding these strings to and from bytes is necessary before passing them to the operating system. Python uses the file system encoding to perform this conversion (see sys.getfilesystemencoding()).
Changed in version 3.1: On some systems, conversion using the file system encoding may fail. In this case, Python uses the
surrogateescape
encoding error handler, which means that undecodable bytes are replaced by a Unicode character U+DCxx on decoding, and these are again translated to the original byte on encoding.
It means that fullPath
might contain U+DCxx
surrogates if the original contains undecodable bytes and print(fullPath)
may fail even if terminal uses compatible character encoding. os.fsencode(fullPath)
can return the original bytes if you need it.
Upvotes: 3
Reputation: 498
I would recommend reading this presentation on unicode and encoding in python to understand what might happen, and how to tackle it.
Upvotes: 2