Reputation: 907
Strange case, Yandex bot is massively overloading my website. The problem is mine obviously, as I have some ajax filters on website active but quite heavy if they are called altogether like bots do.
I have tried with many robots.txt
, but they are having no effect. The kind of URL needed to be blocked are in this following form:
/de/component/customfilters/0-zu-15-eur/nKein+Herstellerf.html?custom_f_116[0]=38&custom_f_116[1]=4c&custom_f_116[2]=39&start=100
But they are URL rewrited not physical. The physycal folder is already blocked in robotx.txt
How can solve this and how can check if Yandex bot are not reading the robots.txt
file?
Every time I edit robots.txt
file should I restart Apache? I think no like htaccess
Upvotes: 0
Views: 3775
Reputation: 790
If your website is currently under heavy load from this crawler, it is possible that making the appropriate changes to your robots.txt won't actually help right now. The lovely people of the Yandex dev team do claim that their bots will visit robots.txt before it crawls - but I think that, if the crawl has started, it may not read any changes until the next time it wants to crawl. They may also have a cached copy of your robots.txt from before you changed it. You can take a look in your server logs to see if they've visited robots.txt since it was changed. My guess is probably not.
There's also the possibility that a bad bot is pretending to be Yandex while crawling your site. Bad bots usually ignore the robots.txt rules anyway. So any changes you make may affect Yandex correctly, but not the bad bots.
In either case, if this crawler is putting your server under heavy load now, then you'll want to block them now and decide later if you want to make that a temporary or permanent block.
One way to do this is using the BRowserMatchNoCase directive in .htacccess
:
BrowserMatchNoCase "Yandex" bots
Order Allow,Deny
Allow from ALL
Deny from env=bots
Or, you could use a rewrite rule in .htaccess
, instead:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(Yandex).*$ [NC]
RewriteRule .* - [F,L]
It doesn't matter whether the URL is being rewritten or not, the bot will crawl any URLs it finds unless you disallow that URL. If you're disallowing the physical folder and the URL doesn't point to that folder, then the Disallow won't work.
Try something like this in your robots.txt
:
Disallow: /de/component/customfilters/
This will ask all bots not to crawl any urls that contain /de/component/customfilters/
.
If you only want to talk to Yandex bots, you can specify that, too:
User-agent: Yandex # directives after this line will only apply to Yandex bots.
Disallow: /de/component/customfilters/
If you want to check that Yandex are reading your robots.txt, they have a test tool here:
http://webmaster.yandex.ru/robots.xml (the page is in Russian)
If you just want Yandex to slow down, you can add a crawl delay directive for Yandex bots:
User-agent: Yandex # directives after this line will only apply to Yandex bots.
Crawl-delay: 2 # specifies a delay of 2 seconds
More information: https://help.yandex.com/webmaster/controlling-robot/robots-txt.xml#crawl-delay
Upvotes: 3