Aman Deep Gautam
Aman Deep Gautam

Reputation: 8747

Efficient implementation of a Bloom filter in C?

This question has been asked previously but there was no answer for it at that time so I decided to ask it again.

I need an efficient implementation of a Bloom filter in C (not C++). If there is no such thing available, I would not mind implementing one, if given some good reference so that it doesn't take too much of my time.

I want to use this data structure for inserts and tests in a ratio (1:20k), so primarily it is test-intensive. The data to be tested is 64 bit integers.

Upvotes: 14

Views: 15342

Answers (6)

Andrei Ciobanu
Andrei Ciobanu

Reputation: 12848

I am not sure how efficient it is, because I haven't done extensive benchmarks, but I did document the chain of thoughts on "How To Implement a Bloom Filter in C" in this article, with the associated GitHub repo.

The code is simplified and it works on char*, but it can be extended for void* with a little bit of tweaking.

The main approach would be to look for a Bit Vector (also known as a Bit Array) implementation that will represent the underlying memory structure for your Bloom Filter.

There are a few options here:

  • Using something that exists already (e.g.: xtrapbits.h
  • Writing it yourself. That's a good exercise for practising bitwise operations.

The second step is to find yourself some good hash functions. Usually, I recommend using MurmurHash, SpookyHash, or FNV.

After having those in place, the code is easier to write, write yourself a struct like this one:

typedef uint32_t (*hash32_func)(const void *data, size_t length);

typedef struct bloom_filter_s {
    bit_vect *vect; // <--- Implementation dependent
    hash32_func *hash_functions;
    size_t num_functions;
    size_t num_items;
} bloom_filter;


bloom_filter *bloom_filter_new(size_t size, size_t num_functions, ...);
bloom_filter *bloom_filter_new_default(size_t size);
void bloom_filter_free(bloom_filter *filter);
void bloom_filter_put(bloom_filter *filter, const void *data, size_t length);
void bloom_filter_put_str(bloom_filter *filter, const char *str);
bool bloom_filter_test(bloom_filter *filter, const void *data, size_t lentgth);
bool bloom_filter_test_str(bloom_filter *filter, const char *str);

And then start implementing the methods.

For example, adding an element to the Bloom filter should be as simple as:

void bloom_filter_put(bloom_filter *filter, const void *data, size_t length){
    for(unsigned i = 0; i < filter->num_functions; i++) {
        uint32_t cur_hash = filter->hash_functions[i](data, length);
        bit_vect_set1(filter->vect, cur_hash % filter->vect->size);
        filter->num_items++;
    }
}

Also, contrary to popular belief you don't need K hash functions, you only need two and you can generate the others from them. Source here.

Upvotes: 0

Thomas Mueller
Thomas Mueller

Reputation: 50127

Several Bloom filter implementations, and alternative algorithms, are available in this project: https://github.com/FastFilter/fastfilter_cpp

Covered are the regular Bloom filter, blocked Bloom filter, cuckoo filter, Golomb coded set, and others. It is C++, but the main algorithms are easy to port to C.

Upvotes: 1

eqzx
eqzx

Reputation: 5599

Chromium has one in C++

github link

Upvotes: 2

unwind
unwind

Reputation: 400019

Not to do too much self-promotion, but I've written a plugin for the Geany editor/IDE that filters out duplicate text lines, it uses a Bloom filter.

The implementation is in C, and you can find it right here on GitHub. It's GPL v3, so depending on your exact needs you may or may not be able to use it.

Some notes about my implementation:

  • It's designed to filter strings, and doesn't abstract the key type. This means you're going to have to modify the key handling to suit your needs.
  • It supports un-characteristic semantics, you can actually use it for totally non-probabilistic existance-testing if you want to (see the BloomContains callback function pointer used by bloom_filter_new()). Just pass NULL to get a "pure" filter.
  • The string hash function is MurmurHash2 by Austin Appleby. I evaluated the more current MurmurHash3, but version 2 was easier to work with.
  • To fit in the Geany eco system, this code uses GLib types throughout.

It hasn't been heavily tuned for performance, but should be okay. I would appreciate any feedback you might have after testing it, of course!

Upvotes: 4

Optimized Coder
Optimized Coder

Reputation: 189

I know this is an old question, but for reference, here are some github search results.

A simple search on github for 'bloomfilter' yields a ton of results (for C) :

https://github.com/search?q=extension%3Ac+bloomfilter&type=Code&ref=searchresults

This is for REFERENCE only.

Upvotes: 0

Jyri J. Virkki
Jyri J. Virkki

Reputation: 189

I have a stand-alone plain C library here which may be of use: https://github.com/jvirkki/libbloom

Upvotes: 17

Related Questions