spacitron
spacitron

Reputation: 2183

Comparing images to find duplicates

I have a few (38000) picture/video files in a folder. Approximately 40% of these are duplicates which I'm trying to get rid of. My question is, how can I tell if 2 files are identical? So far I tried to use a SHA1 of the files but it turns out that many duplicates files had different hashes. This is the code I was using:

public static String getHash(File doc) {
    MessageDigest md = null;
    try {
        md = MessageDigest.getInstance("SHA1");
        FileInputStream inStream = new FileInputStream(doc);
        DigestInputStream dis = new DigestInputStream(inStream, md);
        BufferedInputStream bis = new BufferedInputStream(dis);
        while (true) {
            int b = bis.read();
            if (b == -1)
                break;
        }

        inStream.close();
        dis.close();
        bis.close();
    } catch (NoSuchAlgorithmException | IOException e) {
        e.printStackTrace();
    }

    BigInteger bi = new BigInteger(md.digest());

    return bi.toString(16);
}

Can I modify this in any way? Or will I have to use a different method?

Upvotes: 9

Views: 14822

Answers (7)

Farruh Habibullaev
Farruh Habibullaev

Reputation: 2392

The question was asked long time ago. I have found the following link very useful, it has codes for all languages. https://rosettacode.org/wiki/Percentage_difference_between_images#Kotlin

Here is the code for Kotlin taken from the link

import java.awt.image.BufferedImage
import java.io.File
import javax.imageio.ImageIO
import kotlin.math.abs

fun getDifferencePercent(img1: BufferedImage, img2: BufferedImage): Double {
    val width = img1.width
    val height = img1.height
    val width2 = img2.width
    val height2 = img2.height
    if (width != width2 || height != height2) {
        val f = "(%d,%d) vs. (%d,%d)".format(width, height, width2, height2)
        throw IllegalArgumentException("Images must have the same dimensions: $f")
    }
    var diff = 0L
    for (y in 0 until height) {
        for (x in 0 until width) {
            diff += pixelDiff(img1.getRGB(x, y), img2.getRGB(x, y))
        }
    }
    val maxDiff = 3L * 255 * width * height
    return 100.0 * diff / maxDiff
}

fun pixelDiff(rgb1: Int, rgb2: Int): Int {
    val r1 = (rgb1 shr 16) and 0xff
    val g1 = (rgb1 shr 8)  and 0xff
    val b1 =  rgb1         and 0xff
    val r2 = (rgb2 shr 16) and 0xff
    val g2 = (rgb2 shr 8)  and 0xff
    val b2 =  rgb2         and 0xff
    return abs(r1 - r2) + abs(g1 - g2) + abs(b1 - b2)
}

fun main(args: Array<String>) {
    val img1 = ImageIO.read(File("Lenna50.jpg"))
    val img2 = ImageIO.read(File("Lenna100.jpg"))

    val p = getDifferencePercent(img1, img2)
    println("The percentage difference is ${"%.6f".format(p)}%")
}

Upvotes: 0

Android Geek
Android Geek

Reputation: 636

You can check different percentage of two images through below method and if different percentage os below 10 then you can call it identical image:

 private static double getDifferencePercent(BufferedImage img1, BufferedImage img2) {
    int width = img1.getWidth();
    int height = img1.getHeight();
    int width2 = img2.getWidth();
    int height2 = img2.getHeight();
    if (width != width2 || height != height2) {
        throw new IllegalArgumentException(String.format("Images must have the same dimensions: (%d,%d) vs. (%d,%d)", width, height, width2, height2));
    }

    long diff = 0;
    for (int y = 0; y < height; y++) {
        for (int x = 0; x < width; x++) {
            diff += pixelDiff(img1.getRGB(x, y), img2.getRGB(x, y));
        }
    }
    long maxDiff = 3L * 255 * width * height;

    return 100.0 * diff / maxDiff;
}

private static int pixelDiff(int rgb1, int rgb2) {
    int r1 = (rgb1 >> 16) & 0xff;
    int g1 = (rgb1 >>  8) & 0xff;
    int b1 =  rgb1        & 0xff;
    int r2 = (rgb2 >> 16) & 0xff;
    int g2 = (rgb2 >>  8) & 0xff;
    int b2 =  rgb2        & 0xff;
    return Math.abs(r1 - r2) + Math.abs(g1 - g2) + Math.abs(b1 - b2);
}
  // covert image to Buffered image through this method

public static BufferedImage toBufferedImage(Image img)
{
    if (img instanceof BufferedImage)
    {
        return (BufferedImage) img;
    }

    // Create a buffered image with transparency
    BufferedImage bimage = new BufferedImage(img.getWidth(null), img.getHeight(null), BufferedImage.TYPE_INT_ARGB);

    // Draw the image on to the buffered image
    Graphics2D bGr = bimage.createGraphics();
    bGr.drawImage(img, 0, 0, null);
    bGr.dispose();

    // Return the buffered image
    return bimage;
}

Get insight idea from this site : https://rosettacode.org/wiki/Percentage_difference_between_images#Kotlin

Upvotes: 0

spacitron
spacitron

Reputation: 2183

It's been a long time so I should probably explain how I finally solved my problem. The real trick was to not use hashes to begin with and instead just compare the timestamps in the exif data. Given that these pictures were taken either by me of my wife it would have been quite unlikely for different files to have the same timestamp, hence this simpler solution was actually much more reliable.

Upvotes: 0

dangt85
dangt85

Reputation: 73

Besides using hash, if your duplicates have different sizes (because they were resized), you could compare pixel by pixel (maybe not the entire image but a sub-section of the image).

This may depend on the image format but you could compare by comparing the height and width and then go pixel by pixel using the RGB code. To make it more efficient you can decide a threshold of comparison. For example:

public class Main {
    public static void main(String[] args) throws IOException {
        ImageChecker i = new ImageChecker();
        BufferedImage one = ImageIO.read(new File("D:/Images/460249177.jpg"));
        BufferedImage two = ImageIO.read(new File("D:/Images/460249177a.jpg"));
        if(one.getWidth() + one.getHeight() >= two.getWidth() + two.getHeight()) {
            i.setOne(one);
            i.setTwo(two);
        } else {
            i.setOne(two);
            i.setTwo(one);
        }
        System.out.println(i.compareImages());
    }
}

public class ImageChecker {

    private BufferedImage one;
    private BufferedImage two;
    private double difference = 0;
    private int x = 0;
    private int y = 0;

    public ImageChecker() {

    }

    public boolean compareImages() {
        int f = 20;
        int w1 = Math.min(50, one.getWidth() - two.getWidth());
        int h1 = Math.min(50, one.getHeight() - two.getHeight());
        int w2 = Math.min(5, one.getWidth() - two.getWidth());
        int h2 = Math.min(5, one.getHeight() - two.getHeight());
        for (int i = 0; i <= one.getWidth() - two.getWidth(); i += f) {
            for (int j = 0; j <= one.getHeight() - two.getHeight(); j += f) {
                compareSubset(i, j, f);
            }
        }

        one = one.getSubimage(Math.max(0, x - w1), Math.max(0, y - h1),
                Math.min(two.getWidth() + w1, one.getWidth() - x + w1),
                Math.min(two.getHeight() + h1, one.getHeight() - y + h1));
        x = 0;
        y = 0;
        difference = 0;
        f = 5;
        for (int i = 0; i <= one.getWidth() - two.getWidth(); i += f) {
            for (int j = 0; j <= one.getHeight() - two.getHeight(); j += f) {
                compareSubset(i, j, f);
            }
        }
        one = one.getSubimage(Math.max(0, x - w2), Math.max(0, y - h2),
                Math.min(two.getWidth() + w2, one.getWidth() - x + w2),
                Math.min(two.getHeight() + h2, one.getHeight() - y + h2));
        f = 1;
        for (int i = 0; i <= one.getWidth() - two.getWidth(); i += f) {
            for (int j = 0; j <= one.getHeight() - two.getHeight(); j += f) {
                compareSubset(i, j, f);
            }
        }
        System.out.println(difference);
        return difference < 0.1;
    }

    public void compareSubset(int a, int b, int f) {
        double diff = 0;
        for (int i = 0; i < two.getWidth(); i += f) {
            for (int j = 0; j < two.getHeight(); j += f) {
                int onepx = one.getRGB(i + a, j + b);
                int twopx = two.getRGB(i, j);
                int r1 = (onepx >> 16);
                int g1 = (onepx >> 8) & 0xff;
                int b1 = (onepx) & 0xff;
                int r2 = (twopx >> 16);
                int g2 = (twopx >> 8) & 0xff;
                int b2 = (twopx) & 0xff;
                diff += (Math.abs(r1 - r2) + Math.abs(g1 - g2) + Math.abs(b1
                        - b2)) / 3.0 / 255.0;
            }
        }
        double percentDiff = diff * f * f / (two.getWidth() * two.getHeight());
        if (percentDiff < difference || difference == 0) {
            difference = percentDiff;
            x = a;
            y = b;
        }
    }

    public BufferedImage getOne() {
        return one;
    }

    public void setOne(BufferedImage one) {
        this.one = one;
    }

    public BufferedImage getTwo() {
        return two;
    }

    public void setTwo(BufferedImage two) {
        this.two = two;
    }
}

Upvotes: 4

Abhishek Anand
Abhishek Anand

Reputation: 1992

You need to use aHash, pHash and best of both dHash algorithm for this.

I wrote a pure java library just for this few days back. You can feed it with directory path(includes sub-directory), and it will list the duplicate images in list with absolute path which you want to delete. Alternatively, you can use it to find all unique images in a directory too.

It used awt api internally, so can't be used for Android though. Since, imageIO has problem reading alot of new types of images, i am using twelve monkeys jar which is internally used.

https://github.com/srch07/Duplicate-Image-Finder-API

Jar with dependencies bundled internally can be downloaded from, https://github.com/srch07/Duplicate-Image-Finder-API/blob/master/archives/duplicate_image_finder_1.0.jar

The api can find duplicates among images of different sizes too.

Upvotes: 2

Mathias
Mathias

Reputation: 324

As outlined above duplicate detection can be based on a hash. However, if you want to have near duplicate detection, which means that you are searching for images that basically show the same things, but have been scaled, rotated, etc. you might need a content based image retrieval approach. There's LIRE (https://code.google.com/p/lire/), a Java library for that, and you'll find the "SimpleApplication" in the Download section. What you then can do is to

  1. Index the first image
  2. go to the next image I
  3. Search for I in the index
  4. If there are results with a score below a threshold, then mark them as duplicate
  5. Index I
  6. Go to (2)

Students of mine did it, it worked well, but I don't have the source code at hand. But rest assured, it's just a few lines and the simple application will get you started.

Upvotes: 6

MvG
MvG

Reputation: 61077

You could convert your files with e.g. imagemagick convert to a format which has a canonical representation and as little metadata as possible. I guess I'd use PNM. So try something like this:

convert input.png pnm:- | md5sum -

If this does yield the same result for two files which compared different before, then metadata is in fact the source of your problem, and you can either use some command line approach like this, or update your code to read the image and compute the hash from the raw uncompressed data.

If, on the other hand, different files still compare different, then you have some changes to the actual image data. One possible cause might be the addition or removal of an alpha channel, particularly if you are dealing with PNG here. With JPEG, on the other hand, you'll likely have images uncompressed and then recompressed again, which will lead to slight modifications and data loss. JPEG is an inherently lossy codec, and any two images will likely differ unless they were created using the same application (or library), with the same settings and from the same input data. In that case you'll need to perform a fuzzy image matching. Tools like Geeqie can perform such things. If you want to do this yourself, you'll have a lot of work ahead of you, and should do some research up front.

Upvotes: 1

Related Questions