Reputation: 61
I recently build a simple web crawler and I wish to use it a little in the web. My question is what ethic rules shold I follow and how to follow them. I heard about the robot.txt file, how do I open it in python and what to do with it? And are they other ethic rules I need to follow like max sites per second etc? Thenx in advance.
Upvotes: 1
Views: 969
Reputation: 2828
robot.txt is a simple text file for web-spiders where the site owners listed the pages that they do not want to index by web-spiders. but for most it is not interesting information that you can still scrap by pretend your web-spider to user.
All your requests to page will contain User_agent (russian version with more examples) metadata for proxy server - WHO ARE YOU - user with Firefox or web spider like Feedly fetcher (Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google). And you can also pretend to IE 6.0 user.
Вreach of ethics and morality - not a violation of the criminal law. At each site with content in the basement there is a link "privacy" which in most cases are asked to refer to the source material.
Once I scraped a news site with a speed of 15 pages per second, and I was banned for 10 minutes as a DDoS attack, but when I set the interval between the actions of 200ms. everything worked. But it depends on server configuration.
Upvotes: 3