Franz Kafka
Franz Kafka

Reputation: 844

How to use a Bloom Filter across multiple servers?

I have 50 EC2 instances all crawling the web. Right now they are using Redis on the backend to track URLs that have already been crawled; however, ElastiCache is becoming cost prohibitive, and I keep running into the issue of having too many connections open. I've been looking at implementing a Bloom Filter as a backend, but I don't understand how I can do this so that all 50 servers share the same bloom filter. I don't want each one having their own independent bloom filter, otherwise they're all basically doing the same tasks.

Upvotes: 2

Views: 1561

Answers (1)

Pieter Cailliau
Pieter Cailliau

Reputation: 509

You can still use Redis to keep track of the url's that were already processed/crawled in a centralised way, but reduce the memory footprint by using the bloom filter of RedisBloom (redisbloom.io). RedisBloom is a Redis Module which extends Redis with several probabilistic data structures.

Notes:

  • If a single bloom filter would become too large or the throughput becomes too high for a single shard, you can consider having several bloom filters for this case that can be spread across a Redis Cluster and calculate the appropriate filter (key) at client side.

  • you might want to bump this issue which requests for items in the bloom filter to expire over time, allowing you to revisit url's after a given time.

Upvotes: 9

Related Questions