user1299518
user1299518

Reputation:

Hashing an image (series of rgb bytes)

i'm developing an application that involves screen capture and hashing with C/C++. The image i'm capturing is about 250x250 in dimensions and i'm using the winapi HashData function for hashing.

My goal is to compare 2 hashes (etc. 2 images of 250x250) and instantly tell if they're equal.

My code:

           const int PIXEL_SIZE = (sc_part.height * sc_part.width)*3;
           BYTE* pixels = new BYTE[PIXEL_SIZE];
           for(UINT y=0,b=0;y<sc_part.height;y++) {
              for(UINT x=0;x<sc_part.width;x++) {
                 COLORREF rgb = sc_part.pixels[(y*sc_part.width)+x];
                 pixels[b++] = GetRValue(rgb);
                 pixels[b++] = GetGValue(rgb);
                 pixels[b++] = GetBValue(rgb);
              }
           }
           const int MAX_HASH_LEN = 64;
           BYTE Hash[MAX_HASH_LEN] = {0};
           HashData(pixels,PIXEL_SIZE,Hash,MAX_HASH_LEN);

           ... i have now my variable-size hash, above example uses 64 bytes

           delete[] pixels;

I've tested different hash sizes and their ~time for completion, which was roughly about:

           32 bytes  = ~30ms
           64 bytes  = ~47ms
           128 bytes = ~65ms
           256 bytes = ~125ms

My question is:

How long should the hash code be for a 250x250 image to prevent any duplicates, like never?

I don't like a hash code of 256 characters, since it will cause my app to run slowly (since the captures are very frequent). Is there a "safe" hash size per dimensions of image for comparing?

thanx

Upvotes: 1

Views: 1459

Answers (2)

Ross Ridge
Ross Ridge

Reputation: 39581

Assuming, based on your comments, that you're adding the hash calculated "on-the-fly" to the database, and so the hash of every image in the database ends up getting compared to the hash of every other image in the database then you've run into the birthday paradox. The likelihood that there are two identical numbers in a set of randomly selected numbers (eg. the birthdays of group of people) is greater than what you'd intuitively assume. If there are 23 people in a room then there's a 50:50 chance two of them share the same birthday.

That means assuming a good hash function then you can expect a collision, two images having the same hash despite not being identical, after 2^(N/2) hashes, where N is the number bits in the hash.1 If your hash function isn't so good you can expect a collision even earlier. Unfortunately only Microsoft knows how good HashData actually is.

Your commments also bring up a couple of other issues. One is that HashData doesn't produce variable sized hashes. It produces an array of bytes that's always the same length as the value you passed as the hash length. Your problem is that you're treating it instead as a string of characters. In C++ strings are zero terminated, meaning that the end of string is marked with a zero valued character ('\0'). Since the array of bytes will contain 0 valued elements at random positions it will appear to be truncated when used a string. Treating the hash a string like this will make it much more likely that you'll get a collision.

The other issue is that you said that you stored the images being compared in your database and that these images must be unique. If this uniqueness is being enforced by the database then checking for uniqueness in your own code is redundant. Your database might very well be able to do this faster than your own code.

Upvotes: 2

Eric Brown
Eric Brown

Reputation: 13932

GUIDs (Globally Unique IDs) are 16 bytes long, and Microsoft assumes that no GUIDs will ever collide.

Using a 32 byte hash is equivalent to taking two randomly generated GUIDs and comparing them against two other randomly generated GUIDs.

The odds are vanishingly small (1/2^256) or 1.15792089E-77 that you will get a collision with a 32 byte hash.

The universe will reach heat death long before you get a collision.

This comment from Michael Grier more or less encapsulates my beliefs. In the worst case, you should take an image, compute a hash, change the image by 1 byte, and recompute the hash. A good hash should change by more than one byte.

You also need to trade this off against the "birthday effect" (aka the pigeonhole principle) - any hash will generate collisions. A quick comparison of the first N bytes, though, will typically reject collisions.

Cryptographic hashes are typically "better" hashes in the sense that more hash bits change per input bit change, but are much slower to compute.

Upvotes: 0

Related Questions