Reputation: 77
I am trying to determine whether Google Guava's Bloom Filter will work for a project of mine, however in my tests, I'm getting an extremely high false rate of false positives (presumably due to a high level of hash collisions?).
I'm running the experiment using 2 data files. The first contains 22 million unique numbers (Integers) which I put into the bloom filter. The second contains another set of entirely different numbers, also unique, which I use to test the Bloom Filter for false positives.
This is an example of what some of these numbers look like:
1010061
904436
859990
854448
839175
754186
904491
233955
904491
876342
919575
603051
1012863
989713
323424
My code is the following:
private static void experiment() {
// Load 22m unique IDs from file
ArrayList<String> skus = loadSkus("sku_1.txt");
int numInsertions = skus.size();
// Google Guava Bloom Filter
Funnel<String> strFunnel = (Funnel<String>) (from, into) -> into.putString(from, Charset.forName("US-ASCII"));
BloomFilter<String> bf = BloomFilter.create(strFunnel, numInsertions, 0.001);
for (String sku : skus) {
bf.put(sku);
}
int falsePositiveCount = 0;
double falsePositiveRate;
// Load another set of unique IDs that are NOT in the first set
ArrayList<String> skus2 = loadSkus("sku_2.txt");
for (String sku : skus2) {
if (bf.mightContain(sku)) {
falsePositiveCount++;
}
}
falsePositiveRate = (double)falsePositiveCount / (double)skus2.size();
System.out.println("Expected FPP: " + Double.toString(bf.expectedFpp()));
System.out.println("Measured FP rate: " + Double.toString(falsePositiveRate));
}
the result:
Expected FPP: 7.276343403395039E-27
Measured FP rate: 0.9979594547309587
The measured rate of false positives seems unbelievably high! This is not how this data structure should behave. Am I misusing the library in some way? I would really like to achieve proper performance with the Bloom Filter.
Upvotes: 2
Views: 1819
Reputation: 1219
I couldn't reproduce your results. The only thing I can think is that it has something to do with your data files?
I used the same code you posted except I generated the skus like so:
final List<String> skus = ContiguousSet.create(Range.closedOpen(0, 22000000), DiscreteDomain.integers()).stream().map(String::valueOf).collect(Collectors.toList());
and
final List<String> skus2 = ContiguousSet.create(Range.closedOpen(-22000000, 0), DiscreteDomain.integers()).stream().map(String::valueOf).collect(Collectors.toList());
Results:
Expected FPP: 0.0010001451412535098
Measured FP rate: 9.963636363636364E-4
Upvotes: 5