Johann du Toit
Johann du Toit

Reputation: 2667

Crawling for Eternity

I've recently been building a new web app dealing with Recurring Events. These events can recur on a daily, weekly or monthly basis.

This all is working great. But when I started creating the Event Browser Page (which will be visible to the public internet) a thought came across my mind.

If a crawler hits this page, with a next and previous button to browse the dates, it will just continue forever ? So I opted out of using generic HTML links and used AJAX. Which means that bots will not be able to follow the links.

But this method means I'm losing any that functionality for users without Javascript. Or is the amount of users without Javascript too small to worry ?

Is there a better way to handle this ?

I'm also very interested in how bots like the Google Crawler detects black holes like these and what it does to handle them ?

Upvotes: 4

Views: 181

Answers (2)

John Williams
John Williams

Reputation: 113

Even a minimally functional web crawler requires a lot more sophistication than you might imagine, and the situation you describe is not a problem. Crawlers operate on some variant of a breadth-first search, so even if they do nothing to detect black holes, it's not a big deal. Another typical feature of web crawlers that helps is that they avoid fetching a lot of pages from the same domain in a short time span, because otherwise they would inadvertently be performing a DOS attack against any site with less bandwidth than the crawler.

Even though it's not strictly necessary for a crawler to detect black holes, a good one might have all sorts of heuristics to avoid wasting time on low-value pages. For instance, it may choose ignore a pages that don't have a minimum amount of English (or whatever language) text, pages that contain nothing but links, pages that seem to contain binary data, etc. The heuristics don't have to be perfect because the basic breadth-first nature of the search ensures that no single site can waste too much of the crawler's time, and the sheer size of the web means that even if it misses some "good" pages, there are always plenty of other good pages to be found. (Of course this is from the perspective of the web crawler; if you own the pages being skipped, it might be more of a problem for you, but companies like Google that run web crawlers are intentionally secretive about the exact details of things like that because they don't want people trying to outguess their heuristics.)

Upvotes: 2

tripleee
tripleee

Reputation: 189387

Add a nofollow tag to the page, or to the individual links you don't want crawled. This can be in robots.txt or in the page source. See the Robots Exclusion Standard

You may still need to think about how to fend off ill-behaved bots which do not respect the standard.

Upvotes: 4

Related Questions