Reputation: 844
I have 50 EC2 instances all crawling the web. Right now they are using Redis on the backend to track URLs that have already been crawled; however, ElastiCache is becoming cost prohibitive, and I keep running into the issue of having too many connections open. I've been looking at implementing a Bloom Filter as a backend, but I don't understand how I can do this so that all 50 servers share the same bloom filter. I don't want each one having their own independent bloom filter, otherwise they're all basically doing the same tasks.
Upvotes: 2
Views: 1561
Reputation: 509
You can still use Redis to keep track of the url's that were already processed/crawled in a centralised way, but reduce the memory footprint by using the bloom filter of RedisBloom (redisbloom.io). RedisBloom is a Redis Module which extends Redis with several probabilistic data structures.
Notes:
If a single bloom filter would become too large or the throughput becomes too high for a single shard, you can consider having several bloom filters for this case that can be spread across a Redis Cluster and calculate the appropriate filter (key) at client side.
you might want to bump this issue which requests for items in the bloom filter to expire over time, allowing you to revisit url's after a given time.
Upvotes: 9