Reputation: 3811
Bloom filter uses a bit array of m-bits hence there are 0 to m-1 indexes in the array but the hash functions I'm using return a 32-bit hash hence it could be from 0 to (2^32)-1 as the hash is used as an index for the bit array(filter) it is quite possible that the hash is greater than m as a result the value will not be mapped on the bit array. Should I take the mod of the hash i.e hash % m so that the resulting hash must correspond to an index in the bit array. Will it increase the number of false positives(IMO it will)?
Upvotes: 2
Views: 617
Reputation: 79441
A hash function h: S -> uint
is loosely defined to be one which exhibits a high degree of entropy over the set S
. Suppose I have a certain hash function h
that has very high entropy over S
, but for which the output h(x)
for x
in S
is always even. This restriction simply means that a bit of the output of h
is being wasted, which is only 1 / 32
of the bits.
Now suppose I have a Bloom filter for which m
is an even number. Then h(x) % m
is always going to be an even number - which means only half of the bits of the Bloom filter will ever be used! That's bad!
As others have suggested, simply taking the first m
bits of h(x)
as an index into 2^m
Bloom filter bins is a better strategy, because, assuming you're being given a hash function that exhibits a high degree of entropy over the set S
, the "first m
bits" hash function g(x) = h(x)[0:m-1]
should exhibit a nearly proportional amount of entropy.
Upvotes: 1
Reputation: 643
Yes, using mod increases probability for false positives. Stephan T. Lavavej had a great talk about it on GoingNative 2013 (that mod creates bias), See HERE
His also mentioning about (what @btilly) said: it's better to simply cut bits - if yours hash function is good then it's OK.
Upvotes: 1