RJIGO
RJIGO

Reputation: 1943

Do web crawlers rely ONLY on links from homepage to do their crawling?

My homepage has links to pages a.html and b.html. In the same directory with these 2 pages, I have pages c.html and d.html which are not linked to by any other pages.

My question is Do webcrawlers also index c.html and d.html just because they are in the directory? Or do they only follow the links starting from the home page and index only the homepage plus pages a and b? Thanks.

Upvotes: 0

Views: 619

Answers (2)

Kiril
Kiril

Reputation: 40375

Web crawlers only know about links, so if nobody in the world has a link to pages c.html and d.html, then the likelihood that a crawler will find them is pretty close to 0.

Let's see how a crawler might find those:

  1. Your home page only points to a.html and b.html, but if those pages have links to c/d.html, then a crawler will eventually them.
  2. If the above is not true, but you've given somebody links to c/d.html and they posted those links on some website online, then a crawler will eventually find them.
  3. If you have a sitemap, then a crawler might eventually find them.

This assumes that the crawler is "good" and it's crawling long enough to get to a page which contains links to your c/d.html pages.

Upvotes: 2

Most web crawlers (in particular Google's one) are proprietary programs, so you cannot know for sure how they work in the details.

And web crawlers are incredibly complex in their details. Google's crawler (and indexer) is rumored to be a binary executable of more than 700 megabytes (at GCC summits, Google people are saying that they are compiling a program of that size, and I am guessing it is their crawler).

In theory crawlers do follow links. But you don't master them. For instance, some public mail archive (or even your Gmail account, for Google) may point to your c.html ... even if your main web page don't point to it.

Upvotes: 2

Related Questions