Michael
Michael

Reputation: 33297

Exclude specific Folders from being crawled?

I want to exclude my user folders from being crawled by a search spider.

The structure is as follows. User accounts are under

www.mydomain.com/username

The problem is that I cannot exclude "/" in the disallowed part of my robots.txt because there are also other folders like

 www.mydomain.com/legal
 www.mydomain.com/privacy

There are also items that a user can generate which should be crawlable. They are under

 www.mydomain.com/username/items/itemId

How do I have to set up my robots txt for that scenario?

Upvotes: 0

Views: 664

Answers (2)

plasticinsect
plasticinsect

Reputation: 1752

If at all possible, you should follow taxicala's suggestion to change your directory structure.

If you absolutely cannot change your directory structure, you could use the allow directive and wildcards to deal with both problems:

User-agent: *
Allow: /legal$
Allow: /privacy$
Allow: /*/items/
Disallow: /

Just be aware that not all robots support this syntax. This will definitely work for all major search engines, but it may not work for some older robots. Also, this is not particularly future-proof. If you later add some new top-level pages and you forget to add them to the robots.txt file, they will be silently blocked. The ideal approach is to use a directory structure that isolates the things you want blocked from the things you don't.

Upvotes: 1

taxicala
taxicala

Reputation: 21759

Check the following answered question, maybe it might solve yours:

Robots.txt Disallow Certain Folder Names

Hope this helps.

EDIT

see the following answered question in order to exclude a folder but not its childs

Robots.txt Allow sub folder but not the parent

and you also should consider using the structure as follows:

mydomain.com/users/user1/subfolder
mydomain.com/users/user2/subfolder

in order to target your rules more accurate.

Upvotes: 1

Related Questions