G. Bach
G. Bach

Reputation: 3909

Identify images with same content in Java

A while ago, I spent some time searching for ways to determine whether two images are identical in order to answer this question. I now face a slightly different problem: I have roughly two thousand images at hand, some of which have the same content, but are scaled/rotated versions of each other (rotations are always by multiples of 90°), along with the problem of different compressions and image formats (mostly jpg, some png, nothing else). The scaling doesn't go beyond roughly 2:1. What I'd like to do is eliminate duplicates while retaining the instance of highest quality. Since Java is the only language in which I'm fairly proficient, I need to use Java.

The answers to a different question offer many useful links, but it doesn't look like any among them can identify duplicates when scaled/rotated.

This question along with the answers suggest first scaling all images to a very small size (say 32*32 or 16*16), then basically doing some hashing, and comparisons based on the hash. This sounds smart enough to me, the images could be pre-sorted before comparison, which would after sorting be an O(n) problem. However, given that the images may be rotated, I'm not sure how to deal with it; one option would be to manually go through all the images and decide on a rotation, given that what they depict has clear orientation (the human eye can very easily decide which way "up" should be). If possible, I'd like to avoid that though.

Are there established methods/algorithms (the links mention SSIM) to deal with this kind of problems, or can any of you come up with better ways than described above? Maybe someone knows libraries for Java that would be suited well to the task (in the linked questions there's mention of a Java wrapper for OpenCV, then ImageJ, imgsclr)? Any help is appreciated.

Upvotes: 10

Views: 2662

Answers (2)

Abhishek Anand
Abhishek Anand

Reputation: 1992

Well i think dHash is something you need for this. You just have to improve dHash to take into consideration of rotation, that means 2000 images will be considered as 8000 images.

I wrote a pure java library just for this few days back. You can feed it with directory path(includes sub-directory), and it will list the duplicate images in list with absolute path which you want to delete. Alternatively, you can use it to find all unique images in a directory too.

It used awt api internally, so can't be used for Android though. Since, imageIO has problem reading alot of new types of images, i am using twelve monkeys jar which is internally used.

https://github.com/srch07/Duplicate-Image-Finder-API

Jar with dependencies bundled internally can be downloaded from, https://github.com/srch07/Duplicate-Image-Finder-API/blob/master/archives/duplicate_image_finder_1.0.jar

The api can find duplicates among images of different sizes too.

Upvotes: 0

Andrew Mao
Andrew Mao

Reputation: 36900

I think that the general answer to this question calls for an unsupervised machine learning approach that generates local invariant features - basically, a fancy way of finding hashes that don't change with scaling or rotation - and then running a clustering algorithm. Here are some papers that might be relevant:

Upvotes: 5

Related Questions