Reputation: 1156
I'm trying to write a program to compare files and show the duplicates in python. Anyone know any good functions or methods related to this? I am sorta lost...
Upvotes: 1
Views: 5154
Reputation: 298246
If you're just looking for exact duplicates, do an MD5 hash on both and see if they match:
import hashlib
file1 = open('file1.avi', 'r').read()
file2 = open('file2.avi', 'r').read()
if hashlib.sha512(file1).hexdigest() == hashlib.sha512(file2).hexdigest():
print 'They are the same'
else:
print 'They are different'
If not, I'd try OpenCV's Python Bindings and check if they match up frame by frame.
Upvotes: 3
Reputation: 56654
I would use os.walk to go through the file tree.
For each file, I would store the absolutepath+filename, indexed by file size and signature (first 16 bytes? Hash of first 512 bytes? Hash on full file?).
When finished, you end up with a dict of file sizes; for each size, a dict of file signatures; for each signature, a list of all files sharing that signature. If your file signature is not based on the full file, or has significant chance of collisions, you can then do a more in-depth comparison of just those colliding files.
Upvotes: 1
Reputation: 15209
I would first start out comparing filenames and filesizes. If you find a match, you could then loop through the bytes of the file to compare them, although this is probably pretty intensive.
I do not know of a library that can do this in python.
Upvotes: 0