Reputation: 2301
I'm stuck on a problem with robots.txt
.
I want to disallow http://example.com/forbidden
and allow any other subdirectory of http://example.com
. Normally the syntax for this would be:
User-agent: *
Disallow: /forbidden/
However, I don't want malicious robots to be able to see that the /forbidden/
directory exists at all - there is nothing linking to it on the page, and I want to it be completely hidden to everybody except those that know it's there in the first place.
Is there a way to accomplish this? My first thought was to place a robots.txt
on the subdirectory itself, but this will have no effect. If I don't want my subdirectory to be indexed by either benign or malicious robots, am I safer listing it on the robots.txt
or not listing or linking to it at all?
Upvotes: 1
Views: 684
Reputation: 96737
Even if you don’t link to it, crawlers may find the URLs anyhow:
So you should block them. There are two variants (if you don’t want to use access control):
meta
-robots
(both variants only work for polite bots, of course)
You could use robots.txt without using the full folder name:
User-agent: *
Disallow: /fo
This would block all URLs starting with fo
. Of course you would have to find a string that doesn’t match with other URLs you still want to be indexed.
However, if a crawler finds a blocked page somehow (see above), it may still add the URL to its index. robots.txt only disallows crawling/indexing of the page content, but using/adding/linking the URL is not forbidden.
With the meta
-robots
, however, you can even forbid indexing the URL. Add this element to the head
of the pages you want to block:
<meta name="robots" content="noindex">
For files other than HTML there is the HTTP header X-Robots-Tag
.
Upvotes: 3
Reputation: 3780
You're better off not listing it in robots.txt at all. That file is purely advisory; well-behaved robots will abide by the requests it makes, while rude or malicious ones may well use it as a list of potentially interesting targets. If your site contains no links to the /forbidden/
directory, then no robot will find it in any case save one which carries out the equivalent of a dictionary attack, which can be addressed by fail2ban or some similar log trawler; this being the case, including the directory in robots.txt will at best have no additional benefit, and at worst clue in an attacker to the existence of something he might otherwise not have found.
Upvotes: 1