Reputation: 11

Consistent hashing of a structure to 32 bit

I have a structure having 3 integers in [1, 1000] and string.

I need to represent it in 32 bit number such that two structures differing in at least one field would produce different codes, while structures having the same content consistently produce the same code. Usually one of the integer fields would be increased in several units. This should necessarily produce a different code.

At first, I thought to format the structure fields to a string in a constant format and then to hash it using GetHashCode function of String class. But then I read here in some discussions that repeating process runs on the same input don't necessary produce the same hash output. First of all, is this true in .NET 4? It's important to me, because the hash values should be persisted and stay consistent over process runs. I also saw here suggestions to perform bitwise operations of results of the platform GetHashCode applied on each structure field using prime numbers. But here again, apparently I can't count on the consistent result of the process runs.

If I use cryptographic hash functions, I exceed 32 bit.

If I didn't have a string field, I would compose the code as a 32 bit array from the numeric fields. May be it's worth to XOR such bit array with the string field GetHashCode result? Do I increase chances that repeating run on some input will produce the same hash output?

What would you suggest to do?

Upvotes: 1

Answers (3)

millimoose

Reputation: 39960

Anonymous types have an autogenerated sensible GetHashCode() implementation. I'd try just using:

struct MyStruct 
{
    int _intField1;
    int _intField2;
    int _intField3;
    string _stringField;

    public long GetHashCode() 
    {
        return new { _intField1, _intField2, _intField3, _stringField }.GetHashCode();
    }
}

Since both ints and strings are immutable types, the hash code should remain the same between runs of the application as long the underlying .NET framework version is the same. (This might or might not be "persistent enough".)

That said, it might change should the internal implementation of GetHashCode() change. In that case, use a cryptographic hash. It doesn't matter that it exceeds 32 bits, because cryptographic hashes are designed to produce wildly different output for small changes in input. This means that for two different inputs, any given 32 bits of the hash code are very unlikely to be equal. Just use BitConverter.ToInt32() to convert whichever part of the hash you want to an int.

Also, obviously, this will just make it somewhat unlikely that two different structures will produce different hash codes. (This can be determined using an approximate formula for the birthday paradox, if I'm reading wiki correctly, it means you'd have a 10% chance of getting duplicates once you store ~~~140,000~~ ~30,000 records. Assuming the cryptographic hash has ideal properties. I'm not sure you can do better without a perfect hash.)

Upvotes: 1

Timothy Shields

Reputation: 79511

Serialize your type to a byte[]
Apply common hash algorithm to byte[] to obtain hash byte[]
Pull out, for example, the first 32 bits of the hash byte[] and use that

Upvotes: 0

Cédric Bignon

Reputation: 13022

If you had the following:

struct 
{
    int A;
    int B;
    int C;
}

Assuming A, B, C are in the range [1, 1000]. It is possible to create a "perfect hash" (without collision) since A, B, C can have each 1000 different possible values. Indeed, log2(1000^3) <= 32 (1000^3 is the number of possible values of the sturcture and the log2 is used to get the number of bits required to store all these values without collision and 32 is the number of bits of an integer).

int MyHashCode()
{
    return 1000 * (1000 * (A - 1) + (B - 1)) + (C - 1);  // There is no overflow or collision since A, B, C are in the range [1, 1000]
}

We can simplify it by using a weaker condition: A, B, C are in the range [0, 1000]:

int MyHashCode()
{
    return 1001 * (1001 * A + B) + C;  // There is no overflow or collision since A, B, C are in the range [0, 1000]
}

Update

Given that your structure contains a string inside it. What you want to achieve is impossible. Because the string can represent an infinite number of values.

If it was possible to do it one could create a very powerful compression algorithm. That could store any file into a... 32 bit number! Mathematically, it comes from the fact that an injective function can only map to a bigger space.

Upvotes: 1

Consistent hashing of a structure to 32 bit

Answers (3)

Update

Related Questions