Reputation: 10775
I am trying to understand how universal hashing works. It is defined h(x) = [(a*x + b) mod p] mod m
where a,b
- random numbers, m
- size of hash table, x
- key, and p
- prime number. For example, I have several different keys:
92333
23347
20313
And in order to create a universal hash function I have to to following:
Let a = 10, b = 22, p = 313, m = 100
h(92333) = [(10 * 92333 + 22) mod 313] mod 100 = 2 mod 100 = 2
h(23347) = [(10 * 23347 + 22) mod 313] mod 100 = 307 mod 100 = 7
...
But probably every time I will get number in range from 0 to 99 and there might be lots of collisions.
So my question is: I understood and applied universal hashing correctly?
Upvotes: 2
Views: 504
Reputation: 340168
Assuming that the numbers you're hashing have a uniform distribution, your function is biased toward buckets 0 through 12.
Assume that the hash operation up to and including the mod 313
operation occurs. The result of that operation gets you a value in the range 0..312. Again, if the result of this operation is even distributed, then take the mod 100
you get the following effect:
result of Occurs for these
mod 100 mod 313 results
----------- ------------------
0 0, 100, 200, 300
1 1, 101, 201, 301
2 2, 102, 202, 302
3 3, 103, 203, 303
4 4, 104, 204, 304
5 5, 105, 205, 305
6 6, 106, 206, 306
7 7, 107, 207, 307
8 8, 108, 208, 308
9 9, 109, 209, 309
10 10, 110, 210, 310
11 11, 111, 211, 311
12 12, 112, 212, 312
13 13, 113, 213
14 14, 114, 214
15 15, 115, 215
Notice how the number of opportunities to get a particular result drop after 12? There's your bias. Here's more evidence of that effect taken from counting the results of hashing the numbers 0 through 5,000,000:
counts[0]: 63898
counts[1]: 63896
counts[2]: 63899
counts[3]: 63900
counts[4]: 63896
counts[5]: 63896
counts[6]: 63900
counts[7]: 63896
counts[8]: 63896
counts[9]: 63900
counts[10]: 63898
counts[11]: 63896
counts[12]: 63899
counts[13]: 47925
counts[14]: 47922
counts[15]: 47922
counts[16]: 47925
... elided similar counts ...
counts[97]: 47922
counts[98]: 47922
counts[99]: 47925
Upvotes: 1
Reputation: 262464
But probably every time I will get number in range from 0 to 99 and there might be lots of collisions.
Right. But your hash table only has 100 buckets, so you cannot avoid collisions if you try to stick in more than a few dozen keys.
The best you can hope for is to spread out the collisions evenly across the whole one hundred buckets, which your hash function should be able to do roughly. That way, you won't run into a lot of collisions until the table fills up, and the collisions won't have too many parties involved each.
If you want to store a lot more keys, you need to make the table bigger.
Upvotes: 1