Reputation: 31106

Good hash function for unordered set of ID's

I'm having the following problem:

Given an unordered, arbitrary large set of id's (e.g. 1,2,5,6,8 in a 32-bit space)
Calculate a hash code in a larger space (e.g. 64-bit).

The easy was is to calculate a hash function for each individual ID and then XOR everything together. However, if you have a 32-bit space for ID's and a 64-bit space for the hash function, that might not be the best way to approach this (collisions and so on...).

I've been thinking about using the Murmur3 finalizer and then XOR'ing the results together, but I guess that won't work either for the same reason (I'm not sure to be honest). Similarly, simply multiplying values should also work (because ab = ba), but I'm not sure how 'good' that hash function will be.

Obviously, sorting the ID's comes to mind, after which Murmur3 will do nicely. Still, I don't want to sort if I can avoid it.

What's a good algorithm for such a hash function?

update

OK, I guess I might have been a bit confusing.

The second answer on Why is XOR the default way to combine hashes? actually explains about combining hash functions. In the case presented there, XOR is argued to be a bad hash function, because "dab" produces the same code as "abd". In my case, I want these things to generate the same hash value - but I also want to minimize the chance that -say- "abc" also generates the same hash value as -say- "abd".

The whole purpose of most hash functions is that they have a high probability to use the complete key space if you feed them data. In general, these hash functions exploit the fact that data is sequential, and multiply by a large number to shuffle bits around. So in simple terms:

var hash = SomeInitialConstant;
foreach (var id in ids) {
  hash = hash * SomeConstant + hashCode(id);
}
// ... optionally shuffle bits around as finalizer
return hash;

Now, this works fine if the ID's are always in the same order. However, if ID's are unordered, this won't work, because x * constant + y is not commutative.

If you square the ID's, I don't think that you will end up using the entire hash space. Consider what happens if you have large numbers, say 100000, 100001, etc. The quares of those are 10000000000, 10000200001, etc. There's no way you'll get a square ever to generate a number like 900000 (simply because sqrt(900000) is a number with a fraction).

In more general terms, it's likely that all the hash space between 10000000000 and 10000200001 will get lost. However, the space between -say- 0 and 10 will get a lot of collisions, because the available hash space between the squares of small numbers if small as well.

The whole purpose of using a large key space is obviously having few collisions. I would like to have a pretty large hash space (say, 256 bits) to ensure collisions are virtually non-existent in real-life scenario's.

Upvotes: 2

Answers (3)

joop

Reputation: 4503

I just checked :

using 32 bit hashes
in a 64K table of arrays
with 64K items (load-factor = 100%)
of 8 bit values (unsigned characters )
(array size 4...64)
hash function := cnt+ (sum cube (arr[i]))
or := sum(square (zobrist[arr[i]))
Zobrist works a bit better, (but array needs fursther randomisation)
and collisions were not more than expected for an optimal hash function.
to avoid recomputation (space-time tradeoff), I actually store the hash values within the objects
since collisions are a fact of life, you could postpone the sorting to the moment when you actually need it for the final comparison (when the chain length start to grow above 1)

#include <stdio.h>
#include <stdlib.h>

struct list {
        struct list *next;
        unsigned hash;
        unsigned short cnt;
        unsigned char *data;
        };

struct list *hashtab[1<<16] = {NULL, };
#define COUNTOF(a) (sizeof a / sizeof a[0])
unsigned zobrist[256] = {0,};
/*************************/
unsigned hash_it(unsigned char *cp, unsigned cnt)
{
unsigned idx;
unsigned long long hash = 0;

for(idx=0; idx < cnt; idx++) {
#if 0   /* cube */
        hash += (cp[idx] * cp[idx] * cp[idx]);
#else
        unsigned val;
        val = zobrist[cp[idx]];
        hash += (val * val);
#endif
        }
#if 0   /* as a tie-breaker: add the count (this avoids pythagorean triplets but *not* taxi-numbers) */
hash += cnt;
#endif
return hash;
}
/*************************/
struct list *list_new(unsigned cnt){
struct list *p;
unsigned idx;

p = malloc( sizeof *p + cnt);
p->data = (unsigned char*)(p+1);
p->cnt = cnt;
p->next = NULL;

for(idx=0; idx < cnt; idx++) {
        p->data[idx] = 0xff & rand();
        }
p->hash = hash_it(p->data, p->cnt);
return p;
}
/*************************/
void do_insert(struct list *this)
{
struct list **pp;
unsigned slot;

slot  = this->hash % COUNTOF(hashtab);
for (pp = &hashtab[slot]; *pp; pp = &(*pp)->next) {;}
*pp = this;
}
/*************************/
void list_print(struct list *this)
{
unsigned idx;
if (!this) return;

printf("%lx data[%u] = ", (unsigned long) this->hash, this->cnt);

for (idx=0; idx < this->cnt; idx++) {
        printf("%c%u"
        , idx ? ',' : '{' , (unsigned int) this->data[idx] );
        }
printf("}\n" );
}
/*************************/
unsigned list_cnt(struct list *this)
{
unsigned cnt;
for(cnt=0; this; this=this->next) { cnt++; }
return cnt;
}
/*************************/
unsigned list_cnt_collisions(struct list *this)
{
unsigned cnt;
for(cnt=0; this; this=this->next) {
        struct list *that;
        for(that=this->next; that; that=that->next) {
                if (that->cnt != this->cnt) continue;
                if (that->hash == this->hash) cnt++;
                }
        }
return cnt;
}
/*************************/
int main(void)
{
unsigned idx, val;
struct list *p;
unsigned hist[300] = {0,};

        /* NOTE: you need a better_than_default random generator
        ** , the zobrist array should **not** contain any duplicates
        */
for (idx = 0; idx < COUNTOF(zobrist); idx++) {
        do { val = random(); } while(!val);
        zobrist[idx] = val;
        }

        /* a second pass will increase the randomness ... just a bit ... */
for (idx = 0; idx < COUNTOF(zobrist); idx++) {
        do { val = random(); } while(!val);
        zobrist[idx] ^= val;
        }
        /* load-factor = 100 % */
for (idx = 0; idx < COUNTOF(hashtab); idx++) {
        do {
          val = random();
          val %= 0x40;
        } while(val < 4); /* array size 4..63 */
        p = list_new(val);
        do_insert(p);
        }

for (idx = 0; idx < COUNTOF(hashtab); idx++) {
        val = list_cnt( hashtab[idx]);
        hist[val] += 1;
        val = list_cnt_collisions(hashtab[idx]);
        if (!val) continue;
        printf("[%u] : %u\n", idx, val);
        for (val=0,p = hashtab[idx]; p; p= p->next) {
                printf("[%u]: ", val++);
                list_print(p);
                }
        }

for (idx = 0; idx < COUNTOF(hist); idx++) {
        if (!hist[idx]) continue;
        printf("[%u] = %u\n", idx, hist[idx]);
        }

return 0;
}
/*************************/

Output histogram (chain lengths, 0 := empty slot):

$ ./a.out
[0] = 24192
[1] = 23972
[2] = 12043
[3] = 4107
[4] = 1001
[5] = 181
[6] = 34
[7] = 4
[8] = 2

Final note: instead of sum of squares of Zobrist[] you could just as well XOR them together (assuming the entries are unique)

Extra final note: the C stdlib rand() function can be unusable. RAND_MAX could be only 15 bits: 0x7fff (32767). For populating the zobrist table you need more values. This can be done by XORing some extra (rand() << shift) into the higher bits.

New results, using (a sample from) a very large source domain (32 elements * 8bits), hashing it to 32bits hashkeys, inserting into a hashtable of 1<<20 slots.

Number of elements 1048576 number of slots 1048576
Element size = 8bits, Min setsize=0, max set size=32
(using Cubes, plus adding size) Histogram of chain lengths:
[0] = 386124 (0.36824)
[1] = 385263 (0.36742)
[2] = 192884 (0.18395)
[3] = 64340 (0.06136)
[4] = 16058 (0.01531)
[5] = 3245 (0.00309)
[6] = 575 (0.00055)
[7] = 78 (0.00007)
[8] = 9 (0.00001)

This is very close to optimal; for a 100% loaded hash table, the first two entries in the histogram should be equal, in the perfect case, both 1/e. The first two entries are empty slots and slots with exactly one element.

Upvotes: 1

atlaste

Reputation: 31106

This answer is just for completeness.

From the solution by @joop, I noticed that he uses less bits than me. Also, he also proposed using x^3 instead of x^2, which makes a huge difference.

In my code, I use 8 bit id's for the test, because of the resulting small key space. This means we can simply test all chains with a length up to 4 or 5 id's. The hash space is 32 bits. The (C#) code is very simple:

static void Main(string[] args)
{
    for (int index = 0; index < 256; ++index)
    {
        CreateHashChain(index, 4, 0);
    }

    // Create collision histogram:
    Dictionary<int, int> histogram = new Dictionary<int, int>();
    foreach (var item in collisions)
    {
        int val;
        histogram.TryGetValue(item.Value, out val);
        histogram[item.Value] = val + 1;
    }

    foreach (var item in histogram.OrderBy((a) => a.Key))
    {
        Console.WriteLine("{0}: {1}", item.Key, item.Value);
    }
    Console.ReadLine();
}

private static void CreateHashChain(int index, int size, uint code)
{
    uint current = (uint)index;

    // hash
    uint v = current * current;
    code = code ^ v;

    // recurse for the rest of the chain:
    if (size == 1)
    {
        int val;
        collisions.TryGetValue(code, out val);
        collisions[code] = val + 1;
    }
    else
    {
        for (int i = index + 1; i < 256 - size; ++i)
        {
            CreateHashChain(i, size - 1, code);
        }
    }
}

private static Dictionary<uint, int> collisions = new Dictionary<uint, int>();

Now, this is all about the hash function. I'll just write down some of the things I tried with the findings:

x^2

Code:

// hash
uint v = current * current;
code = code ^ v;

Results: lots and lots and lots of collisions. In fact, there isn't a case that doesn't collide less than 3612 times. Obviously we're using only 16 bits, so it can be explained just fine. Anyways, the result is Pretty Bad.

x^3

Code:

// hash
uint v = current * current * current;
code = code ^ v;

Results:

Still pretty bad, but again, we're only using 24 bits of the key space so collisions are bound to happen. Also, it's much better than using x^2.

x^4

Code:

// hash
uint v = current * current;
v = v * v;
code = code ^ v;

Results:

1: 118795055
2: 20402127
3: 2740658
4: 329621
5: 38453
6: 4420
7: 495
8: 47
9: 12

As expected, this is much better and obviously this is due to the fact that we're using the full 32-bits now.

Introducing y

Another way to introduce a larger key space is to introduce another variable -say- y which is a function of x. The idea behind this is that x^n for small values of x will result in small numbers, thereby having a high chance of collisions; we can offset this by ensuring that y will be a large number if x is small and doing bit arithmetic to combine two hash functions. The easiest way to do this is to cause bit flips for all bits:

// hash
uint x = current;
uint y = (255 ^ current);

uint v1 = (UInt16)(x * x * x);
uint v2 = (UInt16)(y * y * y);
code = code ^ v1 ^ (v2 << 16);

This will result in the following:

1: 154971022
2: 6827322
3: 235081
4: 7554
5: 263
6: 9
7: 1

Interestingly, this immediately gives much better results than all the previous approaches. It also immediately raises the question if the 16-bit cast makes any sense. After all, x^3 will result in a 24-bit space with large gaps for small values of x. Combining that with another 24-bit space that's shifted will make better use of the available 32 bits. Note that we should still shift by 16 (and not 8!) for the same reason.

1: 162671251
2: 3276751
3: 45277
4: 473
5: 5

Multiply by constant (final result)

Another way to blow up the key-space for y is to multiply-and-add. The code now becomes:

uint x = current;
uint y = (255 ^ current);
y = (y + 7577) * 0x85ebca6b;

uint v1 = (x * x * x);
uint v2 = (y * y * y);
code = code ^ v1 ^ (v2 << 8);

While this might not seem like an improvement, it has the advantage that we can easily scale 8 bit sequences to any arbitrary n bit sequence using this trick. I shift by 8 because I don't want the bits of v1 to interfere too much with the bits of v2. This gives the following result:

1: 162668435
2: 3277904
3: 45459
4: 464
5: 5

This is actually pretty good! We only have a 2% chance to collide, given all possible chains of 4 id's. Also, if we have larger chains, we can add more bits by using the same trick we performed with v2 (adding 8 bits for each additional hash code, so a 256 bit hash should be able to accomodate chains of about 29 8-bit id's).

Only question that remains is: how can we test this? As @joop pointed out in his program, the math is actually quite complex; random sampling might actually prove to be a solution for this for larger number of bits and larger chains.

Upvotes: 0

ruakh

Reputation: 183201

In my case, I want these things to generate the same hash value - but I also want to minimize the chance that -say- "abc" also generates the same hash value as -say- "abd".

Bitwise-XOR actually guarantees that: if two sets of the same size are the same except for one element, then they will necessarily have different bitwise-XORs. (Incidentally, the same is true of summation with wraparound: if two sets of the same size are the same except for one element, then they will necessarily have different sums-with-wraparound.)

So if you use bitwise-XOR for the bottom 32 bits, then you essentially have 32 "extra" bits to try to reduce collisions further: to reduce cases where two sets of different sizes have the same checksum, or cases where two sets that differ by two or more elements have the same checksum. A relatively simple approach is to select a function f that maps from 32-bit integers to 32-bit integers, and then apply bitwise-XOR to the result of applying f to each element. Major things you'd want from f:

it should be cheap and simple to implement.
it should map zero to a nonzero value (so that { 1, 2, 3 } and { 0, 1, 2, 3 } will have different checksums).
the mapping should not involve reorganizing bits in a constant way (e.g. bit-shifts), since reorganize_bits(a) XOR reorganize_bits(b) is equivalent to reorganize_bits(a XOR b), so it doesn't add any independent information to your checksum.
the mapping shouldn't involve XOR-ing with constants, for the same reason.

Above, joop suggests f(a) = a² MOD 2³², which seems decent to me, except for the issue with zero. Maybe f(a) = (a + 1)² MOD 2³²?

Upvotes: 0

Good hash function for unordered set of ID&#39;s

Answers (3)

Related Questions

Good hash function for unordered set of ID's