Reputation: 121
my Google App Engine site is being crawled by a lot of bots and it got much worse recently. The number of bots skyrocketed and most of them don't check robots.txt and it costs me. Is there a way to prevent bad bots that don't check robots.txt from launching app engine?
Upvotes: 1
Views: 588
Reputation: 39824
Unfortunately not, robots.txt
is only effective for the well-behaved bots which properly implement and respect the conventions. From How do I prevent robots scanning my site?:
The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server:
User-agent: * Disallow: /
but this only helps with well-behaved robots.
And from the quoted link:
Can I block just bad robots?
In theory yes, in practice, no. If the bad robot obeys /robots.txt, and you know the name it scans for in the User-Agent field. then you can create a section in your /robotst.txt to exclude it specifically. But almost all bad robots ignore /robots.txt, making that pointless.
If the bad robot operates from a single IP address, you can block its access to your web server through server configuration or with a network firewall.
If copies of the robot operate at lots of different IP addresses, such as hijacked PCs that are part of a large Botnet, then it becomes more difficult. The best option then is to use advanced firewall rules configuration that automatically block access to IP addresses that make many connections; but that can hit good robots as well your bad robots.
Upvotes: 1