Reputation: 20906

How does a crawler ensure a maximum coverage?

I read some articles on Web crawling and learnt the basics of crawling. According to them, the web crawlers just use the URLs retrieved by the other web pages and going through a tree (a mesh practically)

In this case how does a crawler ensures the maximum coverage. Obviously there may be a lot of sites that don't have referral links from other pages/sites. Does the search engines follow any other mechanisms other than crawling and manual registration? (i.e getting information from domain registries)

If they are just based on crawling, How should we select a good set of "Root" sites to begin crawling? (We have no way to predict the results. If we select 100 sites with no referel links the engine will come up with just 100 sites + their inner pages)

Upvotes: 4

Answers (3)

sharptooth

Reputation: 170499

There're no magic mechanism that would allow a crawler to find a site not referred to by any other site already crawled or not being manually added to the crawler.

The crawler only traverses the graph of links starting with a set of manually registered - and therefore predefined - roots. Everything that is off the graph will be unreachable to the crawler - it will have no means for finding this content.

Upvotes: 1

Michael Borgwardt

Reputation: 346270

Obviously there may be a lot of sites that don't have referral links from other pages/sites.

I don't think this really is as big a problem as you think.

Does the search engines follow any other mechanisms other than crawling and manual registration? (i.e getting information from domain registries)

None that I heard of.

If they are just based on crawling, How should we select a good set of "Root" sites to begin crawling?

Any kind of general-purpose web directory like the open directory project would be an ideal candidate, as would social bookmark sites like Digg or del.icio.us

Upvotes: 3

Andy White

Reputation: 88345

One method used to help crawlers is a "sitemap." The sitemap is basically a file that lists out the contents of the website, so that the crawler knows where to navigate, especially if your site has dynamic content. A more accurate sitemap will greatly increase the accuracy of a crawler.

Here's some info on the Google sitemap:

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40318

Upvotes: 1

How does a crawler ensure a maximum coverage?

Answers (3)

Related Questions