tomriddle_1234
tomriddle_1234

Reputation: 3213

How to pick up repeat image pairs (exactly same) among lots of lossless compressed images ? How to std::hash in memory?

My application problem is that, I can get around 500 images, but there might be 1 or 2 of a pair of 2 images are completely the same, this means the files' checksum are the same. My eventual goal is to find out which ones are the repeated image paris.

However now I have to apply a compression algorithm on these 500 images, because the uncompressed images occupy too much disk space. Well, the compression breaks the checksum, so that I cannot use the checksum of the compressed images file to find out which are the repeated image pairs.

Fortunately, my compression algorithm is lossless, this means the restored uncompressed images can still be hashed somehow. But I just want to do this in memory without much disk write access. So my problem is how to efficiently pick up repeated image among large number of images files in memory?

I use opencv often, but the answer will be good as long as it is efficient without saving any file on disk. Python/Bash code will be also acceptable, C/C++ and OpenCV is preferred.

I can think of use OpenCV 's Mat, with std::hash, but std::hash won't work directly, I have to code the std::hash<cv::Mat> specifically, and I don't know how to do it properly yet.

Of course I can do this,

For each 2 images in all my images:
            if ((cv::Mat)img1 == (cv::Mat)img2):
                   print img1 and img2 are identical

But this is extremely inefficient, basically a n^4 algorithm.

Note my problem is not image similarity problem, it is a hashing problem in memroy.

Upvotes: 0

Views: 818

Answers (3)

tomriddle_1234
tomriddle_1234

Reputation: 3213

OK, I worked out a solution by myself, welcome if there's even better solution. I paste the code here.

#include "opencv2/core/core.hpp"
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
#include <cstdio>
#include <iostream>
#include <string>
#include <cstring>
#include <functional>
#include <openssl/md5.h>

using namespace std;
using namespace cv;

static void help()
{
}

char *str2md5(const char *str, int length) {
    int n;
    MD5_CTX c;
    unsigned char digest[16];
    char *out = (char*)malloc(33);

    MD5_Init(&c);

    while (length > 0) {
        if (length > 512) {
            MD5_Update(&c, str, 512);
        } else {
            MD5_Update(&c, str, length);
        }
        length -= 512;
        str += 512;
    }

    MD5_Final(digest, &c);

    for (n = 0; n < 16; ++n) {
        snprintf(&(out[n*2]), 16*2, "%02x", (unsigned int)digest[n]);
    }

    return out;
}


int main(int argc, const char** argv)
{
    help();

    if (argc != 2)
    {
        return EXIT_FAILURE ;
    }

    string inputfile = argv[1] ;

    Mat src = imread (inputfile, -1) ;

    if (src.empty())
    {
        return EXIT_FAILURE ;
    }



    cout << str2md5((char*)src.data, (int)src.step[0] * src.rows) << " " << inputfile << endl ;




    return 0;
}

You'll have to install OpenSSL (libssl-dev) in your machine to compile this code. It loads the image into memory, and calculate the md5 value of it. So to find out the repeat image pairs, just write a simple bash/python script using the compiled program to search in the md5 value array of the files. Note that, this md5 check code won't work with huge image files.

Upvotes: 0

usethedeathstar
usethedeathstar

Reputation: 2309

If it is exact copies of the images you want, you can start comparing pixel 1,1 of all images, and group them by what is the same value on pixel 1,1. After that you know groups (hopefully quite many groups?) and than compare for each group pixel 1,2 . That way you do pixel by pixel untill you get a hundred groups or so. Than you just compare them full on, in each group. That way you do your slow n^4 algorithm, but each time on groups of five images, instead of on 500 images at a time. i am assuming you can read in your images pixel by pixel, i know thats possible if they are in .fits, with the pyfits module, but i guess alternatives exist for pretty much any image format?

So the idea behind this is that if pixel 1,1 is different, the entire image is different. This way you can make some list with the values of maybe the first 3 pixels or so. If in that list, there is enough variability, you can do your 1-1 full image checks on much smaller groups of images, instead of on 500 images at a time. Does this sound like it should do what you want?

Upvotes: 0

Ann  Orlova
Ann Orlova

Reputation: 1348

The idea of ​​getting a hash algorithm for image :

  1. Reduce the size of original image (cvResize ()), so that only the important objects will remain on picture ( getting rid of the high frequencies). Reduce the image to 8x8 , then the total number of pixels will be 64 and the hash will fit all kinds of images , regardless of their size and aspect ratio.

  2. Remove the color. Translate the image obtained in the previous step to grayscale. (cvCvtColor ()). Thus, hash will reduce from 192 (64 values ​​of the three channels - red, green, and blue) to 64 values of brightness​​.

  3. Find the average brightness of the resulting image. (cvAvg ())

  4. Binarization of the image. (cvThreshold ()) retains only those pixels that are larger than average (considered them to be 1, and all others as 0).

  5. Building a hash. Translation of the 64 values ​​of 1 and 0 pictures in one 64-bit hash value.

Next, if you need to compare two images, then just build a hash for each of them and count the number of different bits (using Hamming distance). Hamming distance - the number of positions in which the respective numbers of two binary words of the same length are different.

A distance of zero means that it is likely the same image, and the other values ​​characterize how much they differ from each other.

Upvotes: 1

Related Questions