Reputation: 11439
I was working on a simple application to pull some currency conversions from a website, when I received an error message (below) stating they had a no automated extraction policy.
Autoextraction Prohibited
Automated extraction of our content is prohibited. See http://www.xe.com/errors/noautoextract.htm.
I don't really have an intention of breaking their policy but I am curious as to how they can tell. Can anyone enlighten me?
Upvotes: 0
Views: 174
Reputation: 49597
1) User-Agent
2) Introducing a Javascript pop-up.Something like Click OK to enter
.
3) Calculating number of request/hour from a particular ip address if you are not behind NAT.
For more detail take a look at this Pycon talk web-strategies-for-programming-websites-that-don-t-expected-it by asheesh laroia.
Also take a look at A Standard for Robot Exclusion.
Some web-sites also use
4) Captchas and Re-Captchas
5) Redirection which means you need to add a HTTP Referrer
to get your data.
Upvotes: 2
Reputation: 6959
Basically, if you request an URL and you get the HTML page back, there's pretty much nothing the site can do about it - and well, that's just what a webserver is for.
But there are several techniques to stop bots in contrast of a human being requesting the page. Some of them are hints for bots which "behave", others try to detect a bot and stop it.
Upvotes: 0
Reputation: 2206
I think they watch at least two parameters :
Upvotes: 1
Reputation: 80192
It is done at the HTTP Server level by implementing Robot Exclusion protocol.
From Robots exclusion standard
The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code.
Upvotes: 1