ahodder
ahodder

Reputation: 11439

How can a website detect automated extraction?

I was working on a simple application to pull some currency conversions from a website, when I received an error message (below) stating they had a no automated extraction policy.

Autoextraction Prohibited
Automated extraction of our content is prohibited. See http://www.xe.com/errors/noautoextract.htm.

I don't really have an intention of breaking their policy but I am curious as to how they can tell. Can anyone enlighten me?

Upvotes: 0

Views: 174

Answers (4)

RanRag
RanRag

Reputation: 49597

1) User-Agent

2) Introducing a Javascript pop-up.Something like Click OK to enter.

3) Calculating number of request/hour from a particular ip address if you are not behind NAT.

For more detail take a look at this Pycon talk web-strategies-for-programming-websites-that-don-t-expected-it by asheesh laroia.

Also take a look at A Standard for Robot Exclusion.

Some web-sites also use

4) Captchas and Re-Captchas

5) Redirection which means you need to add a HTTP Referrer to get your data.

Upvotes: 2

Alexander Rühl
Alexander Rühl

Reputation: 6959

Basically, if you request an URL and you get the HTML page back, there's pretty much nothing the site can do about it - and well, that's just what a webserver is for.

But there are several techniques to stop bots in contrast of a human being requesting the page. Some of them are hints for bots which "behave", others try to detect a bot and stop it.

Upvotes: 0

berty
berty

Reputation: 2206

I think they watch at least two parameters :

  • the number of queries from the same IP in a time interval
  • User-Agent header in your HTTP queries. If it's empty or it doesn't look like a web browser's User-Agent header, especially if it indicates "Java" or something like that ;), they can assume it's not a "fair use".

Upvotes: 1

Aravind Yarram
Aravind Yarram

Reputation: 80192

It is done at the HTTP Server level by implementing Robot Exclusion protocol.

From Robots exclusion standard

The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web crawlers and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code.

Upvotes: 1

Related Questions