Hashing an image (series of rgb bytes)

Question

i'm developing an application that involves screen capture and hashing with C/C++. The image i'm capturing is about 250x250 in dimensions and i'm using the winapi HashData function for hashing.

My goal is to compare 2 hashes (etc. 2 images of 250x250) and instantly tell if they're equal.

My code:

           const int PIXEL_SIZE = (sc_part.height * sc_part.width)*3;
           BYTE* pixels = new BYTE[PIXEL_SIZE];
           for(UINT y=0,b=0;y



I've tested different hash sizes and their ~time for completion, which was roughly about:

           32 bytes  = ~30ms
           64 bytes  = ~47ms
           128 bytes = ~65ms
           256 bytes = ~125ms


My question is: 

How long should the hash code be for a 250x250 image to prevent any duplicates, like never? 

I don't like a hash code of 256 characters, since it will cause my app to run slowly (since the captures are very frequent). Is there a "safe" hash size per dimensions of image for comparing?

thanx

Ross Ridge · Accepted Answer

Assuming, based on your comments, that you're adding the hash calculated "on-the-fly" to the database, and so the hash of every image in the database ends up getting compared to the hash of every other image in the database then you've run into the birthday paradox. The likelihood that there are two identical numbers in a set of randomly selected numbers (eg. the birthdays of group of people) is greater than what you'd intuitively assume. If there are 23 people in a room then there's a 50:50 chance two of them share the same birthday.

That means assuming a good hash function then you can expect a collision, two images having the same hash despite not being identical, after 2^(N/2) hashes, where N is the number bits in the hash.1 If your hash function isn't so good you can expect a collision even earlier. Unfortunately only Microsoft knows how good HashData actually is.

Your commments also bring up a couple of other issues. One is that HashData doesn't produce variable sized hashes. It produces an array of bytes that's always the same length as the value you passed as the hash length. Your problem is that you're treating it instead as a string of characters. In C++ strings are zero terminated, meaning that the end of string is marked with a zero valued character ('\0'). Since the array of bytes will contain 0 valued elements at random positions it will appear to be truncated when used a string. Treating the hash a string like this will make it much more likely that you'll get a collision.

The other issue is that you said that you stored the images being compared in your database and that these images must be unique. If this uniqueness is being enforced by the database then checking for uniqueness in your own code is redundant. Your database might very well be able to do this faster than your own code.

Hashing an image (series of rgb bytes)

Answers (2)

Related Questions