Kirk Ouimet
Kirk Ouimet

Reputation: 28364

Disallow or Noindex on Subdomain with robots.txt

I have dev.example.com and www.example.com hosted on different subdomains. I want crawlers to drop all records of the dev subdomain but keep them on www. I am using git to store the code for both, so ideally I'd like both sites to use the same robots.txt file.

Is it possible to use one robots.txt file and have it exclude crawlers from the dev subdomain?

Upvotes: 14

Views: 24052

Answers (5)

NoobDev
NoobDev

Reputation: 15

Instead of using robots.txt to handle the exclusion of the subdomains, you could use nginx. Although the above methods are sufficient, if you want to ensure that no access is made to your subdomains by search engines, you can configure Nginx to return a 404 or 403 status code for all user agents or specific user agents like Googlebot. Here's an example configuration that returns a 403 Forbidden response:

server {  
    server_name subdomain.website.com;  

    location / {  
        if ($http_user_agent ~* (Googlebot|Bingbot|Slurp)) {  
            return 403;  
        }  
    }  
}  

To exclude subdomains you can also use HTTP headers with nginx. Send the X-Robots-Tag HTTP header with a value of noindex, nofollow with your responses to ensure that pages are not indexed.

Here's an example of how to add the X-Robots-Tag in your Nginx configuration:

    server_name subdomain.website.com;  

    location / {  
        add_header X-Robots-Tag "noindex, nofollow";  
    }  
} 

Upvotes: 0

vhs
vhs

Reputation: 10071

I want Google to drop all of the records of the dev subdomain but keep the www.

If the dev site has already been indexed return a 404 or 410 error to crawlers to delist content.

Is it possible to have one robots.txt file that excludes a subdomain?

If your code is completely static what you're looking for the non-standard host directive:

User-agent: *
Host: www.example.com

But if you can support a templating language it's possible to keep everything in a single file:

User-agent: *
# if ENVIRONMENT variable is false robots will be disallowed.
{{ if eq (getenv "ENVIRONMENT") "production" }}
  Disallow: admin/
  Disallow:
{{ else }}
  Disallow: /
{{ end }}

Upvotes: 0

user3505611
user3505611

Reputation: 31

Keep in mind that if you block Google from indexing the pages under the subdomain, they won't (usually) immediately drop out of the Google index. It merely stops Google from re-indexing those pages.

If the dev subdomain isn't launched yet, make sure it has it's own robots.txt disallowing everything.

However, if the dev subdomain already has pages indexed, then you need to use the robots noindex meta tags first (which requires Google to crawl the pages initially to read this request), then set up the robots.txt file for the dev subdomain once the pages have dropped out of the Google index (set up a Google Webmaster Tools account - it helps to work this out).

Upvotes: 3

Christian Davén
Christian Davén

Reputation: 18127

You could use Apache rewrite logic to serve a different robots.txt on the development domain:

<IfModule mod_rewrite.c>
    RewriteEngine on
    RewriteCond %{HTTP_HOST} ^dev\.qrcodecity\.com$
    RewriteRule ^robots\.txt$ robots-dev.txt
</IfModule>

And then create a separate robots-dev.txt:

User-agent: *
Disallow: /

Upvotes: 29

toddles2000
toddles2000

Reputation: 1132

Sorry, this is most likely not possible. The general rule is that each sub-domain is treated separately and thus would both need robots.txt files.

Often subdomains are implemented using subfolders with url rewriting in place that does the mapping in which you want to share a single robots.txt file across subdomains. Here's a good discussion of how to do this: http://www.webmasterworld.com/apache/4253501.htm.

However, in your case you want different behavior for each subdomain which is going to require separate files.

Upvotes: 5

Related Questions