Travis Gockel
Travis Gockel

Reputation: 27673

Combining Bloom Filters

I am using bloom filters to check for duplicated data in a set. However, there is a need to combine the results of two sets of data into a single filter to check for duplication across the two sets. I devised a function in pseudo-Python to perform this task:

def combine(a : bloom_filter, b : bloom_filter):
    assert a.length == b.length
    assert a.hashes == b.hashes

    c = new bloom_filter(length = a.length, hashes = b.hashes)
    c.attempts = a.attempts + b.attempts
    c.bits = a.bits | b.bits

    # Determining the amount of items
    a_and_b = count(a & b)
    a_not_b = count(a & !b)
    not_a_b = count(!a & b)
    neither = count(!a & !b)
    c.item_count = a_not_b / a.length * a.item_count
                 + not_a_b / b.length * b.item_count
                 + a_and_b / c.length * min(a.item_count, b.item_count)

    return c

Does this even sound correct? I am having considerable internal debate as to whether is is even possible to do what I intend, since much of the information about the source data is lost (which is the point of a bloom filter).

Upvotes: 4

Views: 3367

Answers (2)

Jeremy Adsitt
Jeremy Adsitt

Reputation: 434

It could be possible..... sort of..

lets say set A contains apples and oranges

lets say set B contains peas and carrots

construct a simple 16 bit bloom filter as an example and CRC32 as the hash

crc32(apples) = 0x70CCB02F

crc32(oranges) = 0x45CDF3B4

crc32(peas) = 0xB18D0C2B

crc32(carrots) = 0x676A9E28

Start w/ empty bloom filter (BF) (say 16 bits) for both sets (A, B)

BFA = BFB = 0000 0000 0000 0000 

then, breaking the hash into some bit length, we'll use 4 here we can add apples to the BF. e.g.

Get Apples BF Index list by splitting up the hash:

0x70CCB02F = 0111 0000 1100 1100 1011 0000 0010 1111
             7      0    C    C   B     0    2     F     
----------------------------------------------------

Add Apples to BFA by setting BF bit indexes [ 7, 0, 12, 12, 11, 0, 2, 15]

                                 (set the index bit of an empty BF to 1)
Apples =     1001 1000 1000 0101 (<- see indexes 0,2,7,11,12,15 are set)
BF =         0000 0000 0000 0000  (or operation adds that item to the BF)
================================
Updated BFA = 1001 1000 1000 0101 

Add Oranges to BF same way:

0x45CDF3B4 = 0100 0101 1100 1101 1111 0011 1011 0100
              4    5    12   13   15    3   11   4
----------------------------------------------------
Add oranges to BF by setting BF bit indexes [ 4,5,12,13,15,3,11,4]

Oranges =      1011 1000 0011 1000 
BFA =          1001 1000 1000 0101  (or operation)
================================
Updated BFA =  1011 1000 1011 1101 

So now apples and oranges are inserted into BF1 w/ Final Value of 1011 1000 1011 1101

Do the same for BFB

crc32(peas) = 0xB18D0C2B becomes => 
set [11,2,12,0,13,1,8] in BFB
 0011 1001 0000 0011 = BF(peas)

crc32(carrots) = 0x676A9E28 becomes => 
set [8,2,14,9,10,6,7] in BFB

0100 0111 1100 0100 = BF(carrots)

so BFB = 
0011 1001 0000 0011  BF(peas)
0100 0111 1100 0100  BF(carrots)
===================  ('add' them to BFB via locial or op)
0111 1111 1100 0111

you could now search B for A entries in a loop and vice verse:

Does B contain "oranges" =>

 1011 1000 0011 1000 (Oranges BF representation)
 0111 1111 1100 0111 (BFB)
=====================     (and operation)
 0011 1000 0000 0000  

Because this result (0011 1000 0000 0000) doesn't match the Original BF of Oranges, you can be certain that B doesn't contain any oranges

... ... (do for rest of items)

and following, B doesn't contain any of A items, just as B doesn't contain any of the apples.

I don't think that's what you asked though, and looks like you could computer a difference BF, which is more to your point. Seems like you could do a xor op and that would give you a 'single' array containing both differences:

0111 1111 1100 0111 (BFB)
1011 1000 1011 1101 (BFA)
========================
1100 0111 0111 1010 (BFA xor BFB) == (items in B not in A, and items in A not in B)

meaning with this single BF, you could detect the non-existance of an item 100% of the time, just not the existance of the item 100%.

The way you would use it, is as follows (check if peas is 'missing from A):

 1100 0111 0111 1010 (BFA xor BFB)
 0011 1001 0000 0011 (Peas)
============================== (And operation)
 0000 0001 0000 0010 (non-zero)

since (BFA xor BFB) && (Peas) != 0 you know one set does not contain 'peas'...

again, you'd be testing for item by item, maybe you could do aggregation but probably not a good idea...

Hope this helps!

Upvotes: 1

Travis Gockel
Travis Gockel

Reputation: 27673

You can derive a formula for estimating the amount of items a Bloom Filter:

c = log(z / N) / ((h * log(1 - 1 / N))

N: Number of bits in the bit vector
h: Number of hashes
z: Number of zero bits in the bit vector

This provides a fairly accurate estimate of the number of items in the Bloom Filter. You can come up with an estimate for contribution with simple subtraction.

Upvotes: 2

Related Questions