Combining Bloom Filters

Question

I am using bloom filters to check for duplicated data in a set. However, there is a need to combine the results of two sets of data into a single filter to check for duplication across the two sets. I devised a function in pseudo-Python to perform this task:

def combine(a : bloom_filter, b : bloom_filter):
    assert a.length == b.length
    assert a.hashes == b.hashes

    c = new bloom_filter(length = a.length, hashes = b.hashes)
    c.attempts = a.attempts + b.attempts
    c.bits = a.bits | b.bits

    # Determining the amount of items
    a_and_b = count(a & b)
    a_not_b = count(a & !b)
    not_a_b = count(!a & b)
    neither = count(!a & !b)
    c.item_count = a_not_b / a.length * a.item_count
                 + not_a_b / b.length * b.item_count
                 + a_and_b / c.length * min(a.item_count, b.item_count)

    return c

Does this even sound correct? I am having considerable internal debate as to whether is is even possible to do what I intend, since much of the information about the source data is lost (which is the point of a bloom filter).

Travis Gockel · Accepted Answer

You can derive a formula for estimating the amount of items a Bloom Filter:

c = log(z / N) / ((h * log(1 - 1 / N))

N: Number of bits in the bit vector
h: Number of hashes
z: Number of zero bits in the bit vector

This provides a fairly accurate estimate of the number of items in the Bloom Filter. You can come up with an estimate for contribution with simple subtraction.

Combining Bloom Filters

Answers (2)

Related Questions