Bob
Bob

Reputation: 10775

Universal hashing misunderstanding

I am trying to understand how universal hashing works. It is defined h(x) = [(a*x + b) mod p] mod m where a,b - random numbers, m - size of hash table, x - key, and p - prime number. For example, I have several different keys:

92333
23347
20313

And in order to create a universal hash function I have to to following:

Let a = 10, b = 22, p = 313, m = 100
h(92333) = [(10 * 92333 + 22) mod 313] mod 100 = 2 mod 100 = 2
h(23347) = [(10 * 23347 + 22) mod 313] mod 100 = 307 mod 100 = 7
...

But probably every time I will get number in range from 0 to 99 and there might be lots of collisions.

So my question is: I understood and applied universal hashing correctly?

Upvotes: 2

Views: 504

Answers (2)

Michael Burr
Michael Burr

Reputation: 340168

Assuming that the numbers you're hashing have a uniform distribution, your function is biased toward buckets 0 through 12.

Assume that the hash operation up to and including the mod 313 operation occurs. The result of that operation gets you a value in the range 0..312. Again, if the result of this operation is even distributed, then take the mod 100 you get the following effect:

 result of       Occurs for these
  mod 100        mod 313 results
-----------     ------------------
     0           0, 100, 200, 300
     1           1, 101, 201, 301
     2           2, 102, 202, 302
     3           3, 103, 203, 303
     4           4, 104, 204, 304
     5           5, 105, 205, 305
     6           6, 106, 206, 306
     7           7, 107, 207, 307
     8           8, 108, 208, 308
     9           9, 109, 209, 309
    10          10, 110, 210, 310
    11          11, 111, 211, 311
    12          12, 112, 212, 312
    13          13, 113, 213
    14          14, 114, 214
    15          15, 115, 215

Notice how the number of opportunities to get a particular result drop after 12? There's your bias. Here's more evidence of that effect taken from counting the results of hashing the numbers 0 through 5,000,000:

counts[0]: 63898
counts[1]: 63896
counts[2]: 63899
counts[3]: 63900
counts[4]: 63896
counts[5]: 63896
counts[6]: 63900
counts[7]: 63896
counts[8]: 63896
counts[9]: 63900
counts[10]: 63898
counts[11]: 63896
counts[12]: 63899
counts[13]: 47925
counts[14]: 47922
counts[15]: 47922
counts[16]: 47925

... elided similar counts ...

counts[97]: 47922
counts[98]: 47922
counts[99]: 47925

Upvotes: 1

Thilo
Thilo

Reputation: 262464

But probably every time I will get number in range from 0 to 99 and there might be lots of collisions.

Right. But your hash table only has 100 buckets, so you cannot avoid collisions if you try to stick in more than a few dozen keys.

The best you can hope for is to spread out the collisions evenly across the whole one hundred buckets, which your hash function should be able to do roughly. That way, you won't run into a lot of collisions until the table fills up, and the collisions won't have too many parties involved each.

If you want to store a lot more keys, you need to make the table bigger.

Upvotes: 1

Related Questions