Reputation: 13
everyone !
I am wondering is there a simple way to block automatic content crawler on a shared web host (LAMP, no root access).
For example. I have a large collection of jpg images, and someone decided to make a automatic program (php or others) to download all my image data.
I was thinking of using javascript to decrypt the image at client-side, thus make it more difficult or more effort to collect all the data by the crawler. But I am not sure the impact on browsers without javascript support, and the effectiveness on preventing such crawler.
Of course, good search engine crawler should be allowed.
Apart from images, what about text, audio or video content ? How should I deal with them ?
Upvotes: 1
Views: 692
Reputation: 27609
Unless your content is hidden behind some form of authentication, then anyone who seriously tries will be able to get your content. That said, you can take some measures to make it a little more difficult using your .htaccess
file.
To prevent hotlinking (referencing your files from another site), you can add the following to block access to anything that ends with gif, jpg, js, or css and doesn't have your site as the HTTP_REFERER
:
RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?mydomain.com/.*$ [NC]
RewriteRule \.(gif|jpg|js|css)$ - [F]
You can also block access by user agent (full list here):
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]
And block by IP if you have identified "bad" bots you want to block:
order allow,deny
deny from 123.45.67.89
allow from all
Upvotes: 1
Reputation: 643
This is one of the most asked question, 'How do I protect my content from being stolen'.
Simple answer, you can't, not even against humans. You can however make it harder to get to with some tricks that I will not go into.
The reason it is impossible to fully block someone from stealing your content is that when the person goes to your website, they physically download the output of that page. What I mean by output is what the server sends to the client.
At this point, the client has FULL access to everything the browser is displaying or has used, and you cannot stop this. If you don't want your images taken, then don't put them online.
NOTE: You can put a watermark over your images so that if they are stolen, then you have your logo on them, but that is unappealing for the design in most cases
I hope this helps!
Upvotes: 0