Reputation: 647
I'm having trouble with the TYPES involved with this piece of code I wrote. Ideally I wouldn't pay any mind to encoding types, but sometimes you're forced.
So this is all centered around a directory walk of an NTFS FS on Windows. Certain characters in file names (unicode, it seems) couldn't be written out to files or printed to the standard windows terminal (yes, I tried "chcp 65001" to print, which didn't work, but I need to write to a standard plain text file anyway)
So I do the following. As I understand it Python3 (I'm using 3.2.2) is unicode, so str() objects (and all supporting libs) are unicode, so I did this:
absfilepath = os.path.join(root, file).encode()
thinking utf-8 string would be returned and all is good with the world, but then I was getting errors about implicit type conversions to str()
when I went to file write or stdout. So I did the following:
hashmap[checksum] = str(absfilepath)
(the hashmap is dumped later).
thinking now it's in a native unicode Python3 string...but when I dump it into a file, using this:
for key, val in m.items():
f.write(key + "|" + val + "\n")
I still get this in the file:
e77bceb64d179377731a94186e56281c|b'K:\Filename'
which is indicative as a byte array.
So what am I doing wrong here? I'm sorry 'non-traditional' characters are in this directory tree, I'd rather them not be there, but they're there. How do I store them (convert them?) into a manner that can be printed/written in normal plain text (ASCII?) and why is a byte array being returned from my hashmap where I'm clearly storing a standard string? Dealing with unicode has been a pretty horrific experience for me.
Upvotes: 1
Views: 4992
Reputation: 387507
absfilepath = os.path.join(root, file).encode()
os.path.join()
returns a string, str.encode()
converts the string to a bytes object, so absfilepath
contains a bytes object.
hashmap[checksum] = str(absfilepath)
When you call str()
on a bytes object, the bytes object is not decoded but instead a string representation is created:
>>> str(b'K:\Filename')
"b'K:\\\\Filename'"
>>> str(b'K:\Filename') == repr(b'K:\Filename')
True
So your dictionary now contains lots of "b'some-bytes-string'"
strings.
The “fix” is simple: Just don’t encode the strings you get from os.path.join
.
If you get errors while writing the strings out to the file, then consider specifying an explicit encoding when opening the file in text mode:
with open('some_file', 'w', encoding='utf-8') as f:
…
Then Python will automatically write strings correctly.
Alternatively, to be completely safe, you can also open the file in binary mode and write the encoded strings instead:
with open('some_file', 'bw') as f:
value = key + "|" + val + "\n"
f.write(value.encode()) # write a bytes object
But as long as you are within Python, you don’t need to worry about special characters inside the string objects. Python can handle them; it’s just the output devices that typically fail (e.g. printing to the console).
Upvotes: 3
Reputation: 1121216
You encoded your unicode string:
absfilepath = os.path.join(root, file).encode()
# ^^^^^^^^
This produces a bytestring. Either don't encode, or when storing the paths in your hashmap
decode again:
hashmap[checksum] = absfilepath.decode()
Upvotes: 2