Henley Wing Chiu
Henley Wing Chiu

Reputation: 22535

How can Bloom Filters help in determining if URL already crawled?

I keep hearing about how Bloom Filters can be useful in web crawling, especially in determining whether an URL has already been crawled(since a Bloom Filter is memory-efficient in testing for set membership).

However, in the use case of web crawling, wouldn't the number of bits/buckets need to be enormous, given that there is an almost infinite number of URLs encountered? Especially, if you're Google or a search engine trying to crawl data daily.

So my question is how does a Bloom Filter help in determining if an URL has already been crawled when the # of urls keep increasing, and the # of buckets remain constant?

Upvotes: 1

Views: 2121

Answers (2)

Rajeev Sampath
Rajeev Sampath

Reputation: 2757

For this use case, unless there are a huge number of buckets, you'll be stuck with a large percentage of false positives (it's almost impossible to totally eliminate false positives anyway, even for a small app).

One interesting workaround will be to have multiple levels of bloom filters rather than having a flat structure, for an example, first level is based on just the domain name (such as cnn.com), the next level may contain the extended urls (such as cnn.com/sports/athletics). But when string operations and multiple hash functions are concerned, not sure how well this will perform.

Upvotes: 2

Joe
Joe

Reputation: 28356

Bloom filters are based on hash functions, which produce a finite range of values. Regardless of how many URLs are encountered, each function will return one of the values in its range. Using multiple hash functions to select the bits reduces the likelihood of false positives, but that is always a possibility. However it has a small probability, and is a calculated trade-off between accuracy and efficiency.

There are practical limits to the length of a URL, see this question. Granted that is a staggering number. When more URLs are created, the hash functions and bucket sizes may need to be upgraded, but those available today are quite capable of coping with the URLs available today, with an acceptably small error rate.

Upvotes: 3

Related Questions