Python3 str(), bytes, and unicode

Question

I'm having trouble with the TYPES involved with this piece of code I wrote. Ideally I wouldn't pay any mind to encoding types, but sometimes you're forced.

So this is all centered around a directory walk of an NTFS FS on Windows. Certain characters in file names (unicode, it seems) couldn't be written out to files or printed to the standard windows terminal (yes, I tried "chcp 65001" to print, which didn't work, but I need to write to a standard plain text file anyway)

So I do the following. As I understand it Python3 (I'm using 3.2.2) is unicode, so str() objects (and all supporting libs) are unicode, so I did this:

absfilepath = os.path.join(root, file).encode()

thinking utf-8 string would be returned and all is good with the world, but then I was getting errors about implicit type conversions to str() when I went to file write or stdout. So I did the following:

hashmap[checksum] = str(absfilepath)

(the hashmap is dumped later).

thinking now it's in a native unicode Python3 string...but when I dump it into a file, using this:

for key, val in m.items():
    f.write(key + "|" + val + "
")

I still get this in the file:

e77bceb64d179377731a94186e56281c|b'K:\Filename'

which is indicative as a byte array.

So what am I doing wrong here? I'm sorry 'non-traditional' characters are in this directory tree, I'd rather them not be there, but they're there. How do I store them (convert them?) into a manner that can be printed/written in normal plain text (ASCII?) and why is a byte array being returned from my hashmap where I'm clearly storing a standard string? Dealing with unicode has been a pretty horrific experience for me.

poke · Accepted Answer

absfilepath = os.path.join(root, file).encode()

os.path.join() returns a string, str.encode() converts the string to a bytes object, so absfilepath contains a bytes object.

hashmap[checksum] = str(absfilepath)

When you call str() on a bytes object, the bytes object is not decoded but instead a string representation is created:

>>> str(b'K:\Filename')
"b'K:\\Filename'"
>>> str(b'K:\Filename') == repr(b'K:\Filename')
True

So your dictionary now contains lots of "b'some-bytes-string'" strings.

The “fix” is simple: Just don’t encode the strings you get from os.path.join.

If you get errors while writing the strings out to the file, then consider specifying an explicit encoding when opening the file in text mode:

with open('some_file', 'w', encoding='utf-8') as f:
    …

Then Python will automatically write strings correctly.

Alternatively, to be completely safe, you can also open the file in binary mode and write the encoded strings instead:

with open('some_file', 'bw') as f:
    value = key + "|" + val + "
"
    f.write(value.encode()) # write a bytes object

But as long as you are within Python, you don’t need to worry about special characters inside the string objects. Python can handle them; it’s just the output devices that typically fail (e.g. printing to the console).

Python3 str(), bytes, and unicode

Answers (2)

Related Questions