Reputation: 33297
I want to exclude my user folders from being crawled by a search spider.
The structure is as follows. User accounts are under
www.mydomain.com/username
The problem is that I cannot exclude "/" in the disallowed part of my robots.txt because there are also other folders like
www.mydomain.com/legal
www.mydomain.com/privacy
There are also items that a user can generate which should be crawlable. They are under
www.mydomain.com/username/items/itemId
How do I have to set up my robots txt for that scenario?
Upvotes: 0
Views: 664
Reputation: 1752
If at all possible, you should follow taxicala's suggestion to change your directory structure.
If you absolutely cannot change your directory structure, you could use the allow directive and wildcards to deal with both problems:
User-agent: *
Allow: /legal$
Allow: /privacy$
Allow: /*/items/
Disallow: /
Just be aware that not all robots support this syntax. This will definitely work for all major search engines, but it may not work for some older robots. Also, this is not particularly future-proof. If you later add some new top-level pages and you forget to add them to the robots.txt file, they will be silently blocked. The ideal approach is to use a directory structure that isolates the things you want blocked from the things you don't.
Upvotes: 1
Reputation: 21759
Check the following answered question, maybe it might solve yours:
Robots.txt Disallow Certain Folder Names
Hope this helps.
see the following answered question in order to exclude a folder but not its childs
Robots.txt Allow sub folder but not the parent
and you also should consider using the structure as follows:
mydomain.com/users/user1/subfolder
mydomain.com/users/user2/subfolder
in order to target your rules more accurate.
Upvotes: 1