hultqvist
hultqvist

Reputation: 18471

Spider interval for robots.txt

I have been reading up on web crawling and got a list full of considerations, however there is one concern that I have not found any discussion about yet.

How often should robots.txt be fetched for any given site?

My scenario is, for any specific site, a very slow crawl with maybe 100 pages a day. Lets say a website adds a new section(/humans-only/) which other pages link to. And at the same time add the appropriate line in robots.txt. A spider might find links to this section before updating robots.txt.

Funny how writing down a problem gives the solution. When formulating my question above I got an idea of a solution.

The robots.txt can be updated rarely, like once a day. But all new found links should be placed on hold in a queue until the next update of robots.txt. After robots.txt has been updated all pending links that passes can now be crawled.

Any other ideas or practical experience with this?

Upvotes: 3

Views: 1402

Answers (1)

Jim Mischel
Jim Mischel

Reputation: 134045

All large-scale Web crawlers cache robots.txt for some period of time. One day is pretty common, and in the past I've seen times as long as a week. Our crawler has a maximum cache time of 24 hours. In practice, it's typically less than that except for sites that we crawl very often.

If you hold links to wait for a future version of robots.txt, then you're adding an artificial 24-hour latency to your crawl. That is, if you crawl my site today then you have to hold all those links for up to 24 hours before you download my robots.txt file again and verify that the links you crawled were allowed at the time. And you could be wrong as often as you're right. Let's say the following happens:

2011-03-08 06:00:00 - You download my robots.txt
2011-03-08 08:00:00 - You crawl the /humans-only/ directory on my site
2011-03-08 22:00:00 - I change my robots.txt to restrict crawlers from accessing /humans-only/
2011-03-09 06:30:00 - You download my robots.txt and throw out the /humans-only/ links.

At the time you crawled, you were allowed to access that directory, so there was no problem with you publishing the links.

You could use the last modified date returned by the Web server when you download robots.txt to determine if you were allowed to read those files at the time, but a lot of servers lie when returning the last modified date. Some large percentage (I don't remember what it is) always return the current date/time as the last modified date because all of their content, including robots.txt, is generated at access time.

Also, adding that restriction to your bot means that you'll have to visit their robots.txt file again even if you don't intend to crawl their site. Otherwise, links will languish in your cache. Your proposed technique raises a lot of issues that you can't handle gracefully. Your best bet is to operate with the information you have at hand.

Most site operators understand about robots.txt caching, and will look the other way if your bot hits a restricted directory on their site within 24 hours of a robots.txt change. provided, of course, that you didn't read robots.txt and then go ahead and crawl the restricted pages. Of those few who question the behavior, a simple explanation of what happened is usually sufficient.

As long as you're open about what your crawler is doing, and you provide a way for site operators to contact you, most misunderstandings are easily corrected. There are a few--a very few--people who will accuse you of all kinds of nefarious activities. Your best bet with them is to apologize for causing a problem and then block your bot from ever visiting their sites.

Upvotes: 5

Related Questions