Why am I getting UnicodeDecodeError here?

Question

I am trying to make a python script to find duplicate files in a usb flash drive.

The proccess I am following is creating a list of the file names, hashing each file, then creating an inverse dictionary. However somewhere in the proccess I am getting a UnicodeDecodeError. Could someone help me understand what's going on?

from os import listdir
from os.path import isfile, join
from collections import defaultdict
import hashlib

my_path = r"F:/"

files_in_dir = [ file for file in listdir(my_path) if isfile(join(my_path, file)) ]
file_hashes = dict()

for file in files_in_dir:
    file_hashes[file] = hashlib.md5(open(join(my_path, file), 'r').read()).digest()

inverse_dict = defaultdict(list)

for file, file_hash in file_hashes.iteritems():
    inverse_dict[file_hash].append(file)

inverse_dict.items()

The error that I face is:

Traceback (most recent call last):
  File "C:\Users\Fotis\Desktop\check_dup.py", line 12, in 
    file_hashes[file] = hashlib.md5(open(join(my_path, file), 'r').read()).digest()
  File "C:\Python33\lib\encodings\cp1253.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position 2227: character maps to

Martijn Pieters · Accepted Answer

You are trying to read a file that is not encoded in the default platform encoding (cp1253). By opening the file in text mode (r) Python 3 will try and decode the file contents to unicode. You didn't specify an encoding, so the platform preferred encoding is used.

Open the files in binary mode instead, using rb as the mode. Since you are only calculating the MD5 hash (a function that expects bytes), you should not be using text mode anyway.

Why am I getting UnicodeDecodeError here?

Answers (1)

Related Questions