Artem Nagornui
Artem Nagornui

Reputation: 21

Does Bloom Filter false positive rate increase on intersections/unions?

Didn't find anything on this, so I hope that my question will find the answer here.

Problem set:

Everything belongs to uplift mining with bloom filters.

I have thousands of bloom filters with some max capacity M, amount of items in each filter N.

For the case N won't ever reach M in any case at any stage.

Probability of false positives P - 0.001%

In my problem I need to incrementally perform from several to ±5 incremental intersections,

like A ∩ B ∩ C ∩ D...

such operation will be performed for arbitrary big number (or small depending on my cost function)of different set combinations of different lengs

A ∩ B; A ∩ J ∩ K; T ∩ W ∩ ... ∩ Z; etc.

All received (new) intersection as a bloomfilters (BF∩i), will be combined by union operation,

BF1 U BF2 U ... U BFi


Questions:

Will such operations on bloom filters affect the false positive rate for the ultimate, combined bloom filter (union of multiple intersections) as I might have many of them?

How can I estimate possible accuracy/precision loss (or rather false positive rate increase) for my case?

Will be very thankful for any hint or direction to a relevant material!

Upvotes: 0

Views: 1634

Answers (1)

Jim Mischel
Jim Mischel

Reputation: 134005

The discussion below assumes that all of the Bloom filters in question were created with the same parameters (capacity and hashes). If that's not the case, then your question is harder to answer.

The intersection of two Bloom filters, A and B, will result in a Bloom filter that will have, at most, the number of entries that were in the smaller of the two. That is, if A has fewer entries than B, then the result of A ∩ B cannot contain more items than A contains. Assuming the resulting Bloom filter is constructed with the same parameters as A (i.e. capacity and hashes), then the false positive rate in the result cannot be higher than it was in A or B, because the result cannot contain more items than the smaller of the two.

The union of two Bloom filters, again assuming that all were created with the same parameters, will always have a false positive rate that's at least as high as the Bloom filter with the highest false positive rate. That is, if B's FP rate is higher than that of A, then the FP rate of A U B will always be greater than or equal to the FP rate of B. The reason is that the resulting Bloom filter will always have at least as many items as the larger of the two.

It's important to understand that when you construct a Bloom filter to hold a given number of items, the targeted false positive rate is for when you have added that many items to the Bloom filter. For example, if you create a Bloom filter to hold 1,000,000 items with a FP rate of 0.0001, then after you've added 1,000,000 items to the Bloom filter, you can expect a false positive rate of 1 in 10,000. But if you only add 100,000 items to the Bloom filter, the actual false positive rate will be much lower.

As long as you don't exceed the designed capacity of the Bloom filter, the false positive rate won't exceed the designed value.

Upvotes: 5

Related Questions