Reputation: 261
I'm creating a simple crawler that will scrape from a list of pre-defined sites. My simple question: are there any http headers that the crawler should specifically use? What's considered required, and desirable to have defined?
Upvotes: 3
Views: 3836
Reputation: 331
It's good to tell who you are and your intentions, and how to get a hold of you. I remember from running a site and looking at the access.log for Apaceh that the following info actually had a mission (as some of the ones listed in StromCrawler code):
If you check out Request fields, you'll find two of interest: User-Agent
and from
. The second one is the email address, but last I checked it doesn't appear in the access.log for Apache2. The User-Agent for automated agents should contain name, version and URL to a page with more info about the agent. It also common to have the word "bot" in your agent name.
Upvotes: 2
Reputation: 4864
You should at least specify a custom user agent (as done here by StormCrawler) so that the webmasters of the sites you are crawling can see that you are robot and contact you if needed.
More importantly, your crawler should follow the robots.txt directives, throttle the frequency of requests to the sites, etc... which leads me to the following question : why not reuse and customise an existing open source crawler like StormCrawler, Nutch or Scrapy instead of reinventing the wheel?
Upvotes: 2