A simple duplicate block finding algorithm performs worse when using BloomFilter for lookups

Question

I have concatenated two ISO files into one file. Both the individual ISO files are Linux distros of the same vendor but different versions. In the program I have written (shown below), the concatenated file in read in blocks of 512 bytes and MD5sum of the block is computed. The MD5sum is stored in a Hashet. If a block with the same signature is found using HashSet lookup, this is recorded.

The exact same algorithm is also done using BloomFilter prior to actual look up on the HashSet. As a BloomFilter provides guarantees on "non-containment" only and can provide false-positives on containment, I also look up the HashSet if the BloomFilter reports that a key might be present already.

The concatenated file size is > 1GB and hence the number of 512 byte block signatures exceeds 1.77 million. The performance of the approach using BloomFilter is consistently ~six times more than that of the first approach.

Any reasons why this might be the case? Is there something wrong I have done here?

import com.google.common.base.Charsets;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnel;
import com.google.common.hash.PrimitiveSink;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.HashSet;
import java.util.concurrent.TimeUnit;
import org.apache.commons.codec.digest.DigestUtils;
import org.apache.commons.lang3.time.StopWatch;

public class SimpleDedupTrial {

    public static void main(String[] args) throws IOException {

        int blockSize = 512;

        HashSet signatureSet = new HashSet<>();

        File f = new File(
            "D:\keshav\per\projects\work-area\dedup-temp\merged-iso"
        );

        FileInputStream fis = new FileInputStream(f);

        long length = f.length();
        long sizeSaved = 0l;

        StopWatch sw = new StopWatch();

        int len;
        byte[] buffer = new byte[blockSize];
        while ((len = fis.read(buffer)) != -1) {

            String md5Hex = DigestUtils.md5Hex(buffer);

            if (sw.isStopped()) {
                sw.start();
            }
            if (sw.isSuspended()) {
                sw.resume();
            }

            if (signatureSet.contains(md5Hex)) {
                sizeSaved += len;
            } else {
                signatureSet.add(md5Hex);
            }

            sw.suspend();
        }

        sw.stop();

        fis.close();

        System.out.println("Time: "+sw.getTime(TimeUnit.MILLISECONDS));
        System.out.println("File size in MB: "+convertToMB(length));
        System.out.println("Size saved in MB: "+convertToMB(sizeSaved));
        System.out.println("Signature set size: "+signatureSet.size());
        System.out.println("Duplicate ratio: "+ ((double)sizeSaved * 100 / length));

        System.out.println("With Blooom:");
        useBloomFilter();
    }

    private static long convertToMB(long sizeInBytes) {
        return sizeInBytes / (1024 * 1024);
    }

    private static void useBloomFilter() throws IOException {
        int blockSize = 512;

        Funnel strFunnel = (String t, PrimitiveSink ps) -> {
            ps.putString(t, Charsets.US_ASCII);
        };

        HashSet signatureSet = new HashSet<>();

        File f = new File(
            "D:\keshav\per\projects\work-area\dedup-temp\merged-iso"
        );

        FileInputStream fis = new FileInputStream(f);

        long length = f.length();
        long sizeSaved = 0l;

        BloomFilter signatureBloomFilter = BloomFilter.create(
            strFunnel, (length / blockSize)
        );

        StopWatch sw = new StopWatch();

        int len;
        byte[] buffer = new byte[blockSize];
        while ((len = fis.read(buffer)) != -1) {

            String md5Hex = DigestUtils.md5Hex(buffer);

            if (sw.isStopped()) {
                sw.start();
            }
            if (sw.isSuspended()) {
                sw.resume();
            }

            if (signatureBloomFilter.mightContain(md5Hex)) {
                if (!signatureSet.contains(md5Hex)) {
                    signatureBloomFilter.put(md5Hex);
                    signatureSet.add(md5Hex);
                } else {
                    sizeSaved += len;
                }
            } else {
                signatureBloomFilter.put(md5Hex);
                signatureSet.add(md5Hex);
            }
            sw.suspend();
        }

        sw.stop();

        fis.close();

        System.out.println("Time: "+sw.getTime(TimeUnit.MILLISECONDS));
        System.out.println("File size in MB: "+convertToMB(length));
        System.out.println("Size saved in MB: "+convertToMB(sizeSaved));
        System.out.println("Signature set size: "+signatureSet.size());
        System.out.println("Duplicate ratio: "+ ((double)sizeSaved * 100 / length));
    }
}

Sample output:

Time: 819
File size in MB: 1071
Size saved in MB: 205
Signature set size: 1774107
Duplicate ratio: 19.183032558071734
With Blooom:
Time: 4539
File size in MB: 1071
Size saved in MB: 205
Signature set size: 1774107
Duplicate ratio: 19.183032558071734

xenteros · Accepted Answer

It looks like you missed the point of Bloom filter a bit. We use them when we can't afford memory and agree to lose some accuracy. For example decide to deliver 2 push notifications to 1/100 (or not sending to them) users for saving on storing the collection of those who has already received the notification.

In the HashSet you have expected access time O(1), so the Bloom filter won't speed up the process and as you see it slows it down. On the other hand it uses very little memory which isn't significant enough to appear in your statistics.

It's because it takes approx the same time to indicate not in and more for in.

You can read more here.

A simple duplicate block finding algorithm performs worse when using BloomFilter for lookups

Answers (1)

Related Questions