Reputation: 9803
I am trying to make a python script to find duplicate files in a usb flash drive.
The proccess I am following is creating a list of the file names, hashing each file, then creating an inverse dictionary. However somewhere in the proccess I am getting a UnicodeDecodeError
. Could someone help me understand what's going on?
from os import listdir
from os.path import isfile, join
from collections import defaultdict
import hashlib
my_path = r"F:/"
files_in_dir = [ file for file in listdir(my_path) if isfile(join(my_path, file)) ]
file_hashes = dict()
for file in files_in_dir:
file_hashes[file] = hashlib.md5(open(join(my_path, file), 'r').read()).digest()
inverse_dict = defaultdict(list)
for file, file_hash in file_hashes.iteritems():
inverse_dict[file_hash].append(file)
inverse_dict.items()
The error that I face is:
Traceback (most recent call last):
File "C:\Users\Fotis\Desktop\check_dup.py", line 12, in <module>
file_hashes[file] = hashlib.md5(open(join(my_path, file), 'r').read()).digest()
File "C:\Python33\lib\encodings\cp1253.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position 2227: character maps to <undefined>
Upvotes: 2
Views: 695
Reputation: 1125208
You are trying to read a file that is not encoded in the default platform encoding (cp1253
). By opening the file in text mode (r
) Python 3 will try and decode the file contents to unicode. You didn't specify an encoding, so the platform preferred encoding is used.
Open the files in binary mode instead, using rb
as the mode. Since you are only calculating the MD5 hash (a function that expects bytes), you should not be using text mode anyway.
Upvotes: 5