Reputation: 15842
I got fancy about Bloom filters
so I started reading publications about them. There is one thing, I can't understand. How can we compress Bloom filter
as it is a random 0-1
vector?
Upvotes: 0
Views: 2104
Reputation: 133995
The paper Compressed Bloom Filters (pdf) explains the general idea. On page 3 of that document, they say:
Suppose, however, we instead choose k so that each of the entries in the m bit array is 1 with probability 1/3. Then we can take advantage of this fact to compress the m bit array and reduce the transmission size.
So rather than having a vector designed so that the probability of a bit being set is 1/2, which would create the "random vector" that doesn't compress well, they fiddle with the number of hash functions to affect the probabilities. The resulting array is approximately one-third 1's, and two-thirds 0's, which should prove more compressible.
Upvotes: 3
Reputation: 11968
You don't need to compress the bloom filter.
Not all your keys have a bit to represent them. They are represented by a number of bits that are reused for other keys. That's why you get false positives. When you add keys a,b and c you set a number of bits to 1. For the next key d it can be the case that all the bits that represent it are already set to 1 so you don't need to do anything (and you would get a false positive if you check if it was inserted after inserting a, b and c).
You can set the bloom filter size to whatever you want. If you make it bigger you use more space, but you decrease false positives. If you make it smaller you also increase false positives.
If you really need to make a bloom filter smaller, set it's size to what you can and then check the false positive rate. You can do this by picking a set of different keys, check if the bloom filter says they are already inserted and insert them (in some random order). Make sure the number of keys is representative of your actual use-case.
You can pass it through some compression algorithms but as you said, it's a random 0-1 vector so don't expect to gain a lot.
In general bloom filters are used as a quick check for existence before you make some expensive lookups/reads. You need it in memory to be fast (if you don't care about speed you would just do the lookups) and you need it uncompressed. If it's small enough to hold in memory there's usually no point in compressing it.
Upvotes: 1