Reputation: 472
The problem is that iv'e got a folder with more than 80k images and about 40% of them are duplicate. (some of the pictures are rotated, some have different size, but still its the same image).
At first I used hashing algorithm (with c++/java) to delete all the duplicate images(that have the same size and other properties). But it seems it didnt delete all of them because some picture has a difrrent size (but are visually identical)
iv'e searched alot on the net to find any efficnt algoritam for this problem
the best code i found for my problem is with pHash, but its outdated and isn't working with VS anymore.
if someone have an idea for me, it will be awesome.
thanks
Upvotes: 0
Views: 840
Reputation: 2354
In addition to the hashing algorithm, you could calculate the histogram for each image and then compare them
In rotated images histogram should be exactly the same, for resized images it should be very similar.
Here there's an example of histogram comparison using OpenCV.
I still suggest to use hashing in first place because it should be way more fast and remove the first set of duplicates, then refines it using histogram comparison.
Upvotes: 2