Ethics using web crawler

Question

I recently build a simple web crawler and I wish to use it a little in the web. My question is what ethic rules shold I follow and how to follow them. I heard about the robot.txt file, how do I open it in python and what to do with it? And are they other ethic rules I need to follow like max sites per second etc? Thenx in advance.

khex · Accepted Answer

robot.txt is a simple text file for web-spiders where the site owners listed the pages that they do not want to index by web-spiders. but for most it is not interesting information that you can still scrap by pretend your web-spider to user.

All your requests to page will contain User_agent (russian version with more examples) metadata for proxy server - WHO ARE YOU - user with Firefox or web spider like Feedly fetcher (Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google). And you can also pretend to IE 6.0 user.

Вreach of ethics and morality - not a violation of the criminal law. At each site with content in the basement there is a link "privacy" which in most cases are asked to refer to the source material.

Once I scraped a news site with a speed of 15 pages per second, and I was banned for 10 minutes as a DDoS attack, but when I set the interval between the actions of 200ms. everything worked. But it depends on server configuration.

Ethics using web crawler

Answers (1)

Related Questions