Reputation: 261

Crawler headers

I'm creating a simple crawler that will scrape from a list of pre-defined sites. My simple question: are there any http headers that the crawler should specifically use? What's considered required, and desirable to have defined?

Upvotes: 3

Answers (2)

Espen Klem

Reputation: 331

It's good to tell who you are and your intentions, and how to get a hold of you. I remember from running a site and looking at the access.log for Apaceh that the following info actually had a mission (as some of the ones listed in StromCrawler code):

Agent name - just the brand name of your crawler
Version of your agent software - If issues with earlier versions of the agent, good to see that it's an evolved version
URL to info about agent - A link to an info-page about the crawler. More info on purpose, technical buildup etc. Also a place to get in contact with the people behind a bot.

If you check out Request fields, you'll find two of interest: User-Agent and from. The second one is the email address, but last I checked it doesn't appear in the access.log for Apache2. The User-Agent for automated agents should contain name, version and URL to a page with more info about the agent. It also common to have the word "bot" in your agent name.

Upvotes: 2

Julien Nioche

Reputation: 4864

You should at least specify a custom user agent (as done here by StormCrawler) so that the webmasters of the sites you are crawling can see that you are robot and contact you if needed.

More importantly, your crawler should follow the robots.txt directives, throttle the frequency of requests to the sites, etc... which leads me to the following question : why not reuse and customise an existing open source crawler like StormCrawler, Nutch or Scrapy instead of reinventing the wheel?

Upvotes: 2

Crawler headers

Answers (2)

Related Questions