Reputation: 631
I have 2 image folder containing 10k and 35k images. Each image is approximately the size of (2k,2k).
I want to remove the images which are exact duplicates.
The variation in different images are just a change in some pixels.
I have tried DHashing, PHashing, AHashing but as they are lossy image hashing technique so they are giving the same hash for non-duplicate images too.
I also tried writing a code in python, which will just subtract images and the combination in which the resultant array is not zero everywhere gives those image pair to be duplicate of each other.
Buth the time for a single combination is 0.29 seconds and for total 350 million combinations is really huge.
Is there a way to do it in a faster way without flagging non-duplicate images also.
I am open to doing it in any language(C,C++), any approach(distributed computing, multithreading) which can solve my problem accurately.
Apologies if I added some of the irrelevant approaches as I am not from computer science background.
Below is the code I used for python approach -
start = timeit.default_timer()
dict = {}
for i in path1:
img1 = io.imread(i)
base1 = os.path.basename(i)
for j in path2:
img2 = io.imread(j)
base2 = os.path.basename(j)
if np.array_equal(img1, img2):
err = img1.astype('float') - img2.astype('float')
is_all_zero = np.all((err == 0))
if is_all_zero:
dict[base1] = base2
else:
continue
stop = timeit.default_timer()
print('Time: ', stop - start)
Upvotes: 0
Views: 3154
Reputation: 1
This code checks if there are any duplicates in a folder (it's a bit slow though):
import image_similarity_measures
from image_similarity_measures.quality_metrics import rmse, psnr
from sewar.full_ref import rmse, psnr
import cv2
import os
import time
def check(path_orginal,path_new):#give r strings
original = cv2.imread(path_orginal)
new = cv2.imread(path_new)
return rmse(original, new)
def folder_check(folder_path):
i=0
file_list = os.listdir(folder_path)
print(file_list)
duplicate_dict={}
for file in file_list:
# print(file)
file_path=os.path.join(folder_path,file)
for file_compare in file_list:
print(i)
i+=1
file_compare_path=os.path.join(folder_path,file_compare)
if file_compare!=file:
similarity_score=check(file_path,file_compare_path)
# print(str(similarity_score))
if similarity_score==0.0:
print(file,file_compare)
duplicate_dict[file]=file_compare
file_list.remove(str(file))
return duplicate_dict
start_time=time.time()
print(folder_check(r"C:\Users\Admin\Linear-Regression-1\image-similarity-measures\input1"))
end_time=time.time()
stamp=end_time-start_time
print(stamp)
Upvotes: 0
Reputation:
Use lossy hashing as a prefiltering step, before a complete comparison. You can also generate thumbnail images (say 12 x 8 pixels), and compare for similarity.
The idea is to perform quick rejection of very different images.
Upvotes: 2
Reputation: 1293
You should find the answer on how to delete duplicate files (not only images). Then you can use, for example, fdupes
or find some alternative SW: https://alternativeto.net/software/fdupes/
Upvotes: 1