Abdullah Saleem
Abdullah Saleem

Reputation: 3811

Filter index and Hash functions in Bloom Filter

Bloom filter uses a bit array of m-bits hence there are 0 to m-1 indexes in the array but the hash functions I'm using return a 32-bit hash hence it could be from 0 to (2^32)-1 as the hash is used as an index for the bit array(filter) it is quite possible that the hash is greater than m as a result the value will not be mapped on the bit array. Should I take the mod of the hash i.e hash % m so that the resulting hash must correspond to an index in the bit array. Will it increase the number of false positives(IMO it will)?

Upvotes: 2

Views: 617

Answers (2)

Timothy Shields
Timothy Shields

Reputation: 79441

A hash function h: S -> uint is loosely defined to be one which exhibits a high degree of entropy over the set S. Suppose I have a certain hash function h that has very high entropy over S, but for which the output h(x) for x in S is always even. This restriction simply means that a bit of the output of h is being wasted, which is only 1 / 32 of the bits.

Now suppose I have a Bloom filter for which m is an even number. Then h(x) % m is always going to be an even number - which means only half of the bits of the Bloom filter will ever be used! That's bad!

As others have suggested, simply taking the first m bits of h(x) as an index into 2^m Bloom filter bins is a better strategy, because, assuming you're being given a hash function that exhibits a high degree of entropy over the set S, the "first m bits" hash function g(x) = h(x)[0:m-1] should exhibit a nearly proportional amount of entropy.

Upvotes: 1

Dawid
Dawid

Reputation: 643

Yes, using mod increases probability for false positives. Stephan T. Lavavej had a great talk about it on GoingNative 2013 (that mod creates bias), See HERE

His also mentioning about (what @btilly) said: it's better to simply cut bits - if yours hash function is good then it's OK.

Upvotes: 1

Related Questions