Reputation: 359
I'm running Python 3.5.1 on Windows. I am attempting to find duplicate source code files in a directory by computing their hash. The problem is that Python seems to think some files are empty. Here is the relevant code snippet:
with open(path, 'rb') as afile:
hasher = hashlib.md5()
data = afile.read()
hasher.update(data)
print("len(data): {}, Path: {}, Hash:{}".format(len(data), path, hasher.hexdigest()))
Here is some example output:
len(data): 0, Path: h:\t\TCPServerSocket.h, Hash:d41d8cd98f00b204e9800998ecf8427e
len(data): 0, Path: h:\t\TCPSocket.cpp, Hash:d41d8cd98f00b204e9800998ecf8427e
len(data): 0, Path: h:\t\TCPSocket.h, Hash:d41d8cd98f00b204e9800998ecf8427e
len(data): 5073, Path: h:\t\ConfigFile.cpp, Hash:6188d6a0e0bc02edf27ce232689beff6
I assure you that these files are not empty, and Python is not throwing any errors during execution. Any ideas?
Upvotes: 2
Views: 752
Reputation: 5871
I'll just delete this answer if it is not the case, but it's something you need to check. Put this directly before the open block
print("the path is {!r}".format(path))
print("path exists: ", os.path.exists(path))
print("it is a file: ", os.path.isfile(path))
print("file size is: ", os.path.getsize(path))
Because everything in your output is consistent with that file actually being empty. So maybe it is? My first thought was you might be zeroing out the file elsewhere, although you would figure that out pretty quickly.
Upvotes: 2
Reputation: 2424
I think you should computer the hash by calling hashlib.md5 on the files them self
import hashlib
hashlib.md5("filename").hexdigest()
Let me know if that continues to suggest files are empty
Upvotes: -1