Reputation: 336
Im not sure if im doing this right. I have created multiply "copys" of multiply files, all of them should be different in some way (image augmentation). Now, because maybe the odds are against me i want to check if any of the created files are equal to any other of those created files. Either the odds are with me or i messed up the code badly. Because there are alot of files i can't check them manually. Maybe there would be a faster way than 2 for loops.
I have the following Code.
import sys
import os
import glob
import numpy
import time
import datetime
start_time = time.time()
print(datetime.datetime.now().time())
img_dir = sys.argv[1]
data_path = os.path.join(img_dir,'*g')
files = glob.glob(data_path)
something_went_wrong = False
for f1 in files:
for f2 in files:
if f1 != f2:
if open(f1,"rb").read() == open(f2,"rb").read():
something_went_wrong = True
print(f1)
print(f2)
print("---")
print(something_went_wrong)
print("--- %s seconds ---" % (time.time() - start_time))
Upvotes: 1
Views: 86
Reputation: 18628
As said in comments, grouping by size save time:
import os
from collections import defaultdict
def fin_dup(dir):
files=defaultdict(set)
res=[]
for fn in os.listdir(dir):
if os.path.isfile(fn):
files[os.stat(fn).st_size].add(fn) # groups files by size
for size,s in sorted(files.items(),key=lambda x : x[0],reverse=True): #big first
while s:
fn0=s.pop()
s0={fn0}
for fn in s:
if open(fn0,'rb').read() == open(fn,'rb').read(): s0.add(fn)
s -= s0
if len(s0) > 1: res.append(s0)
return res
This function take less than 1 second to scan a directory with 1000 files and find 79 duplicates. Just hashing files costs 10 seconds.
Upvotes: 1
Reputation: 1082
This approach uses a hashing function combined with a dictionary of the file list with a count of the number of times each element appears - a slight expansion on the other approach.
Presumably you're talking about duplicate filenames in different folders, which would mean I would put the initial file_list
together in a slightly different way, but this is the basis for how I would address this issue (depending on what glob.glob
returns)
import hashlib
file_list = []
def test_hash(filename_to_test1, filename_to_test2):
"""
"""
filename_seq = filename_to_test1, filename_to_test2
output = []
for fname in filename_seq:
with open(fname, "rb") as opened_file:
file_data = opened_file.readlines()
file_data_as_string = b"".join(file_data)
_hash = hashlib.sha256()
_hash.update(file_data_as_string)
output.append(_hash.hexdigest())
if output[0] == output[1]:
print "File match"
else:
print "Mismatch between file and reference value"
possible_duplicates = {}
for idx, fname in enumerate(file_list):
if fname in possible_duplicates:
possible_duplicates[fname].append(idx)
elif fname not in possible_duplicates:
possible_duplicates[fname] = [idx]
for fname in possible_duplicates:
if len(possible_duplicates[fname]) > 1:
for idx, list_item in enumerate(possible_duplicates[fname]):
test_hash(possible_duplicates[fname][0], possible_duplicates[fname][idx])
Upvotes: 1
Reputation: 359
Just try to use a hash as suggested. If one pixel changed, the hash also change.
import hashlib
def hash_file(filename):
# use sha1 or sha256 or other hashing algorithm
h = hashlib.sha1()
# open file and read it in chunked
with open(filename,'rb') as file:
chunk = 0
while chunk != b'':
chunk = file.read(1024)
h.update(chunk)
# return string
return h.hexdigest()
https://www.pythoncentral.io/hashing-files-with-python/
It is not influenced by filename or metadata! Put the results in a dataframe than it is easy to get the duplicates
Upvotes: 1