Igor Savinkin
Igor Savinkin

Reputation: 6267

How to find out my site is being scraped?

How to find out my site is being scraped?

I've some points...

  1. Network Bandwidth occupation, causing throughput problems (matches if proxy used).
  2. When querting search engine for key words the new referrences appear to other similar resources with the same content (matches if proxy used).
  3. Multiple requesting from the same IP.
  4. High requests rate from a single IP. (by the way: What is a normal rate?)
  5. Headless or weird user agent (matches if proxy used).
  6. Requesting with predictable (equal) intervals from the same IP.
  7. Certain support files are never requested, ex. favicon.ico, various CSS and javascript files (matches if proxy used).
  8. The client's requests sequence. Ex. client access not directly accessible pages (matches if proxy used).

Would you add more to this list?

What points might fit/match if a scraper uses proxying?

Upvotes: 6

Views: 4032

Answers (2)

Sh4d0wsPlyr
Sh4d0wsPlyr

Reputation: 968

As a first note; consider if its worthwhile to provide an API for bots for the future. If you are being crawled by another company/etc, if it is information you want to provide to them anyways it makes your website valuable to them. Creating an API would reduce your server load substantially and give you 100% clarity on people crawling you.

Second, coming from personal experience (I created web-crawls for quite a while), generally you can tell immediately by tracking what the browser was that accessed your website. If they are using one of the automated ones or one out of a development language it will be uniquely different from your average user. Not to mention tracking the log file and updating your .htaccess with banning them (if that's what you are looking to do).

Its usually other then that fairly easy to spot. Repeated, very consistent opening of pages.

Check out this other post for more information on how you might want to deal with them, also for some thoughts on how to identify them.

How to block bad unidentified bots crawling my website?

Upvotes: 2

sarin
sarin

Reputation: 5307

I would also add analysis of when the requests by the same people are made. For example if the same IP address requests the same data at the same time every day, it's likely the process is on an automated schedule. Hence is likely to be scraping...

Possible add analysis of how many pages each user session has impacted. For example if a particular user on a particular day has browsed to every page in your site and you deem this unusual, then perhaps its another indicator.

It feels like you need a range of indicators and need to score them and combine the score to show who is most likely scraping.

Upvotes: 1

Related Questions