Jamie
Jamie

Reputation: 2301

robots.txt disallow subdirectory without showing its name to robots

I'm stuck on a problem with robots.txt.

I want to disallow http://example.com/forbidden and allow any other subdirectory of http://example.com. Normally the syntax for this would be:

User-agent: *
Disallow: /forbidden/

However, I don't want malicious robots to be able to see that the /forbidden/ directory exists at all - there is nothing linking to it on the page, and I want to it be completely hidden to everybody except those that know it's there in the first place.

Is there a way to accomplish this? My first thought was to place a robots.txt on the subdirectory itself, but this will have no effect. If I don't want my subdirectory to be indexed by either benign or malicious robots, am I safer listing it on the robots.txt or not listing or linking to it at all?

Upvotes: 1

Views: 684

Answers (2)

unor
unor

Reputation: 96737

Even if you don’t link to it, crawlers may find the URLs anyhow:

  • someone else could link to it
  • some browser toolbars fetch all visited URLs and send them to search engines
  • your URLs could appear in (public) Referer logs of linked pages
  • etc.

So you should block them. There are two variants (if you don’t want to use access control):

  • robots.txt
  • meta-robots

(both variants only work for polite bots, of course)

You could use robots.txt without using the full folder name:

User-agent: *
Disallow: /fo

This would block all URLs starting with fo. Of course you would have to find a string that doesn’t match with other URLs you still want to be indexed.

However, if a crawler finds a blocked page somehow (see above), it may still add the URL to its index. robots.txt only disallows crawling/indexing of the page content, but using/adding/linking the URL is not forbidden.

With the meta-robots, however, you can even forbid indexing the URL. Add this element to the head of the pages you want to block:

<meta name="robots" content="noindex">

For files other than HTML there is the HTTP header X-Robots-Tag.

Upvotes: 3

Aaron Miller
Aaron Miller

Reputation: 3780

You're better off not listing it in robots.txt at all. That file is purely advisory; well-behaved robots will abide by the requests it makes, while rude or malicious ones may well use it as a list of potentially interesting targets. If your site contains no links to the /forbidden/ directory, then no robot will find it in any case save one which carries out the equivalent of a dictionary attack, which can be addressed by fail2ban or some similar log trawler; this being the case, including the directory in robots.txt will at best have no additional benefit, and at worst clue in an attacker to the existence of something he might otherwise not have found.

Upvotes: 1

Related Questions