user15063
user15063

Reputation:

How would you protect a database of links from being scraped?

I have a large database of links, which are all sorted in specific ways and are attached to other information, which is valuable (to some people).

Currently my setup (which seems to work) simply calls a php file like link.php?id=123, it logs the request with a timestamp into the DB. Before it spits out the link, it checks how many requests were made from that IP in the last 5 minutes. If its greater than x, it redirects you to a captcha page.

That all works fine and dandy, but the site has been getting really popular (as well as been getting DDOsed for about 6 weeks), so php has been getting floored, so Im trying to minimize the times I have to hit up php to do something. I wanted to show links in plain text instead of thru link.php?id= and have an onclick function to simply add 1 to the view count. Im still hitting up php, but at least if it lags, it does so in the background, and the user can see the link they requested right away.

Problem is, that makes the site REALLY scrapable. Is there anything I can do to prevent this, but still not rely on php to do the check before spitting out the link?

Upvotes: 6

Views: 337

Answers (5)

hoju
hoju

Reputation: 29452

Most scrapers just analyze static HTML so encode your links and then decode them dynamically in the client's web browser with JavaScript.

Determined scrapers can still get around this, but they can get around any technique if the data is valuable enough.

Upvotes: 0

sutch
sutch

Reputation: 1295

It seems that the bottleneck is at the database. Each request performs an insert (logs the request), then a select (determine the number of requests from the IP in the last 5 minutes), and then whatever database operations are necessary to perform the core function of the application.

Consider maintaining the request throttling data (IP, request time) in server memory rather than burdening the database. Two solutions are memcache (http://www.php.net/manual/en/book.memcache.php) and memcached (http://php.net/manual/en/book.memcached.php).

As others have noted, ensure that indexes exist for whatever keys are queried (fields such as the link id). If indexes are in place and the database still suffers from the load, try an HTTP accelerator such as Varnish (http://varnish-cache.org/).

Upvotes: 2

goat
goat

Reputation: 31813

You could do the ip throttling at the web server level. Maybe a module exists for your webserver, or as an example, using apache you can write your own rewritemap and have it consult a daemon program so you can do more complex things. Have the daemon program query a memory database. It will be fast.

Upvotes: 1

Radu M.
Radu M.

Reputation: 5738

Every thing you do on the client side can't be protected, Why not just use AJAX ?

Have a onClick event that call's an ajax function, that returns just the link and fill it in a DIV on your page, beacause the size of the request an answer is small, it will work fast enougth for what you need. Just make sure in the function you call to check the timestamp, It is easy to make a script that call that function many times to steel you links.

You can check out jQuery, or other AJAX libraries (i use jQuery and sAjax). And I have lots of page that dinamicly change content very fast, The client doesn't even know is not pure JS.

Upvotes: 0

Byron Whitlock
Byron Whitlock

Reputation: 53861

Check your database. Are you indexing everything properly? A table with this many entries will get big very fast and slow things down. You might also want to run a nightly process that deletes entries older than 1 hour etc.

If none of this works, you are looking at upgrading/load balancing your server. Linking directly to the pages will only buy you so much time before you have to upgrade anyway.

Upvotes: 0

Related Questions