Saurabh Agarwal
Saurabh Agarwal

Reputation: 527

How can a webpage be made such that they can not be scraped by bots?

This question has developed off an answer here.

My question therefore is what steps can one take to wend off standard scrapers?

Upvotes: 0

Views: 1707

Answers (5)

David
David

Reputation: 20105

In addition to all the previous mentions of robots.txt, the robots meta tag, and using more javascript, one of the most sure methods that I know of is to put restricted content behind a user login. This will limit all but purpose-built bots. Add a strong captcha (like reCAPTCHA) to the user login and purpose-built bots will be blocked too.

If a site is looking to verify the identity of a client (ie: including whether it's a bot), that's what user-logins are for. :)

User login's can also be disabled if strange activity is detected.

Upvotes: 1

Mike
Mike

Reputation: 2153

  • use CAPTCHA
  • analyze traffic (from where and how often your pages are requested)
  • display text mixed with pictures
  • use more client data processing (JavaScript, Java, Flash)

Upvotes: 1

Ray
Ray

Reputation: 21905

If you can do server-side processing of requests, you can analyze the user agent string and return a 403 if you detect a scraper. This would not be foolproof. An unscrupulous scraper could use a standard browser user agent to fool your code. False positives would deny your site to real users. You may end up denying search engines access to your pages.

But, if you can identify 'standard scrapers', this would be another tool to control access to scrapers which do not respect the robots tag.

Upvotes: 0

Widor
Widor

Reputation: 13275

The key word in your question is "standard" scapers.

There's no way to prevent all possible bots from scraping your site as they could just pose as a regular visitor.

For the 'good' bots, one or both of robots.txt or a META tag specifying whether a bot can index content and/or follow links:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

For the 'bad' ones, you'll have to catch them once and block them on a combination of IP, request/referrer headers, etc.

Upvotes: 1

Gooey
Gooey

Reputation: 4778

Simply by placing a meta tag like

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

This tells a bot that he may not index your site.

Upvotes: 0

Related Questions