tcfritchman
tcfritchman

Reputation: 77

Why is Guava Bloom Filter Performing so Poorly?

I am trying to determine whether Google Guava's Bloom Filter will work for a project of mine, however in my tests, I'm getting an extremely high false rate of false positives (presumably due to a high level of hash collisions?).

I'm running the experiment using 2 data files. The first contains 22 million unique numbers (Integers) which I put into the bloom filter. The second contains another set of entirely different numbers, also unique, which I use to test the Bloom Filter for false positives.

This is an example of what some of these numbers look like:

1010061
904436
859990
854448
839175
754186
904491
233955
904491
876342
919575
603051
1012863
989713
323424

My code is the following:

private static void experiment() {

    // Load 22m unique IDs from file
    ArrayList<String> skus = loadSkus("sku_1.txt");
    int numInsertions = skus.size();

    // Google Guava Bloom Filter
    Funnel<String> strFunnel = (Funnel<String>) (from, into) -> into.putString(from, Charset.forName("US-ASCII"));
    BloomFilter<String> bf = BloomFilter.create(strFunnel, numInsertions, 0.001);

    for (String sku : skus) {
        bf.put(sku);
    }

    int falsePositiveCount = 0;
    double falsePositiveRate;

    // Load another set of unique IDs that are NOT in the first set
    ArrayList<String> skus2 = loadSkus("sku_2.txt");

    for (String sku : skus2) {
        if (bf.mightContain(sku)) {
            falsePositiveCount++;
        }
    }

    falsePositiveRate = (double)falsePositiveCount / (double)skus2.size();

    System.out.println("Expected FPP:     " + Double.toString(bf.expectedFpp()));
    System.out.println("Measured FP rate: " + Double.toString(falsePositiveRate));
}

the result:

Expected FPP:     7.276343403395039E-27
Measured FP rate: 0.9979594547309587

The measured rate of false positives seems unbelievably high! This is not how this data structure should behave. Am I misusing the library in some way? I would really like to achieve proper performance with the Bloom Filter.

Upvotes: 2

Views: 1819

Answers (1)

depsypher
depsypher

Reputation: 1219

I couldn't reproduce your results. The only thing I can think is that it has something to do with your data files?

I used the same code you posted except I generated the skus like so:

final List<String> skus = ContiguousSet.create(Range.closedOpen(0, 22000000), DiscreteDomain.integers()).stream().map(String::valueOf).collect(Collectors.toList());

and

final List<String> skus2 = ContiguousSet.create(Range.closedOpen(-22000000, 0), DiscreteDomain.integers()).stream().map(String::valueOf).collect(Collectors.toList());

Results:

Expected FPP:     0.0010001451412535098
Measured FP rate: 9.963636363636364E-4

Upvotes: 5

Related Questions