Reputation: 911
I have a site in PHP. In recent weeks, my website is getting lot of automated hits from a single location. It indicates that someone is "poaching" the content in an automated manner, instead of visiting the site through a proper browser. I suppose this is being done by tools/utilities like WGET (or CURL or whatever).
Is there a way such automated access can be blocked?
In an attempt to investigate, I tried using WGET on popular sites like Yahoo, US News and Bloomberg, the WGET utility was successful in downloading the the pages (HTML code) from Yahoo and US News. However, similar attempt on a sample Bloomberg page failed.
Command I used:
wget64.exe https://www.bloomberg.com/research//stocks/snapshot/snapshot_article.asp?ticker=CWEN
Resultant file that got saved had the following:
<h2 class="main__heading">We've detected unusual activity from your computer network</h2>
<p class="continue">To continue, please click the box below to let us know you're not a robot.</p>
<div id="px-captcha"></div>
</section>
<section class="box">
<section class="info">
<h3 class="info__heading">Why did this happen?</h3>
<p class="info__text">Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review our <a class="info__link" href="/notices/tos">Terms of Service</a> and <a class="info__link" href="/notices/tos">Cookie Policy</a>
It indicates that at least Bloomberg has a way to prevent such automated access. Does anyone know what a webmaster can implement to prevent such automated access (like Bloomberg has implemented).
While I agree that access on the internet should be free, sometimes a few boundaries need to be implemented to prevent unauthorized access.
Upvotes: 2
Views: 3290
Reputation: 2762
Wget can easily be captured using the following in your .htaccess file.
RewriteCond %{HTTP_USER_AGENT} wget.* [NC]
RewriteRule .* - [F,L]
However, if the User Agent string is changed, then you may never know that it is Wget.
Also you may look on how to block robots. http://www.robotstxt.org/
http://www.javascriptkit.com/howto/htaccess13.shtml
Upvotes: 3