Reputation: 801
I have a problem in my web crawler where I am trying to retrieve images from a particular website. Problem is that often I see images that are exactly same but different in URL i.e. their address.
Is there any Java library or utility that can identify if 2 images are exactly same in their content (i.e. at pixel level).
My input will be URLs for the images where I can download them.
Upvotes: 11
Views: 9233
Reputation: 1992
I wrote a pure java library just for this few days back. You can feed it with directory path(includes sub-directory), and it will list the duplicate images in list with absolute path which you want to delete. Alternatively, you can use it to find all unique images in a directory too.
It used awt api internally, so can't be used for Android though. Since, imageIO has problem reading alot of new types of images, i am using twelve monkeys jar which is internally used.
https://github.com/srch07/Duplicate-Image-Finder-API
Jar with dependencies bundled internally can be downloaded from, https://github.com/srch07/Duplicate-Image-Finder-API/blob/master/archives/duplicate_image_finder_1.0.jar
The api can find duplicates among images of different sizes too.
Upvotes: 0
Reputation: 232
You can compare images using:
1) simple pixel by pixel comparison
It will not give very good results when there is some shift, rotation, illumination change, ...
2) Relatively simple but more advanced approach
http://www.lac.inpe.br/JIPCookbook/6050-howto-compareimages.jsp
3) More advanced algorithms
For example RadpiMiner and IMMI extension contains several image comparison algorithms, you can experiment with different approaches and select, which suits you best for your purpose...
Upvotes: 1
Reputation: 38868
Inspect the response headers and interrogate the HTTP header ETag value, if present. (RFC2616: ETag) They maybe the same for identical images coming from your target web server. This is because the ETag value is often a message digest like MD5, which would allow you to take advantage of the webserver's already completed computations.
This may potentially allow you to not even download the image!
for each imageUrl in myList
Perform HTTP HEAD imageUrl
Pull ETag value from request
If ETag is in my map of known ETags
move on to next image
Else
Download image
Store ETag in map
Of course the ETag has to be present and if not, well the idea is toast. But maybe you have pull with the web server admins?
Upvotes: 0
Reputation:
Hashing is already suggested and recognizing if two files are identical is very easy, but you said pixel level. If you want to recognize two images even if they are in different formats (.png/.jpg/.gif/..) and even if they were scaled I suggest: (using an image library and if the image are medium/big no 16x16 icons):
You will do a sum of the difference of all the grey pixels of both images you get a number if the difference is < T you consider both images identical
--
Upvotes: 1
Reputation: 21795
Look at the MessageDigest class. Essentially, you create an instance of it, then pass it a series of bytes. The bytes could be the bytes directly loaded from the URL if you know that two images that are the "same" will be the selfsame file/stream of bytes. Or if necessary, you could create a BufferedImage from the stream, then pull out pixel values, something like:
MessageDigest md = MessageDigest.getInstance("MD5");
ByteBuffer bb = ByteBuffer.allocate(4 * bimg.getWidth());
for (int y = bimg.getHeight()-1; y >= 0; y--) {
bb.clear();
for (int x = bimg.getWidth()-1; x >= 0; x--) {
bb.putInt(bimg.getRGB(x, y));
}
md.update(bb.array());
}
byte[] digBytes = md.digest();
Either way, MessageDigest.digest() eventually gives you a byte array which is the "signature" of the image. You could convert this to a hex string if it's helpful, e.g. for putting in a HashMap or database table, e.g.:
StringBuilder sb = new StringBuilder();
for (byte b : digBytes) {
sb.append(String.format("%02X", b & 0xff));
}
String signature = sb.toString();
If the content/image from two URLs gives you the same signature, then they're the same image.
Edit: I forgot to mention that if you were hashing pixel values, you'd probably want to include the dimensions of the image in the hash too. (Just to a similar thing-- write two ints to an 8-byte ByteBuffer, then update the MessageDigest with the corresponding 8-byte array.)
The other thing is that somebody mentioned is that MD5 is not collision-resistent. In other words, there is a technique for constructing multiple byte sequences with the same MD5 hash without having to use the "brute force" method of trial and error (where on average, you'd expect to have to try about 2^64 or 16 billion billion files before hitting on a collision). That makes MD5 unsuitable where you're trying to protect against this threat model. If you're not concerned about the case where somebody might deliberately try to fool your duplicate identification, and you're just worried about the chances of a duplicate hash "by chance", then MD5 is absolutely fine. Actually, it's not only fine, it's actually a bit over the top-- as I say, on average, you'd expect one "false duplicate" after about 16 billion billion files. Or put another way, you could have, say, a billion files and the chance of a collision be extremely close to zero.
If you are worried about the threat model outlined (i.e. you think somebody could be deliberately dedicating processor time to constructing files to fool your system), then the solution is to use a stronger hash. Java supports SHA1 out of the box (just replace "MD5" with "SHA1"). This will now give you longer hashes (160 bits instead of 128 bits), but with current knowledge makes finding a collision infeasible.
Personally for this purpose, I would even consider just using a decent 64-bit hash function. That'll still allow tens of millions of images to be compared with close-to-zero chance of a false positive.
Upvotes: 4
Reputation: 61526
Depending on how detailed you want to get with it:
Regardless of if you want to do all that or not you need to:
No need to rely on any special imaging libraries, images are just bytes.
Upvotes: 7
Reputation: 1376
I've done something very similar to this before in Java and I found that the PixelGrabber class inside the java.awt.image package of the api is extremely helpful (if not downright necessary).
Additionally you would definitely want to check out the ColorConvertOp class which can performs a pixel-by-pixel color conversion of the data in the source image and the resulting color values are scaled to the precision of the destination image. The documentation goes on to say that the images can even be the same image in which case it would be quite simple to detect if they are identical.
If you were detecting similarity, you need to use some form of averaging method as mentioned in the answer to this question
If you can, also check out Volume 2 chapter 7 of Horstman's Core Java (8th ed) because there's a whole bunch of examples on image transformations and the like, but again, make sure to poke around the java.awt.image package because you should find you have almost everything prepared for you :)
G'luck!
Upvotes: 8
Reputation: 77101
calculate MD5s using something like this:
MessageDigest m=MessageDigest.getInstance("MD5");
m.update(image.getBytes(),0,image.length());
System.out.println("MD5: "+new BigInteger(1,m.digest()).toString(16));
Put them in a hashmap.
Upvotes: 1
Reputation: 7706
You could also generate a MD5 signature of the file and ignore duplicate entries. Won't help you find similar images though.
Upvotes: 2
Reputation: 117499
I would think you don't need an image library to do this - simply fetching the URL content and comparing the two streams as byte arrays should do it.
Unless of course you are interested in identifying similar images as well.
Upvotes: 1