Reputation: 28364
I have dev.example.com and www.example.com hosted on different subdomains. I want crawlers to drop all records of the dev
subdomain but keep them on www
. I am using git to store the code for both, so ideally I'd like both sites to use the same robots.txt file.
Is it possible to use one robots.txt file and have it exclude crawlers from the dev
subdomain?
Upvotes: 14
Views: 24052
Reputation: 15
Instead of using robots.txt to handle the exclusion of the subdomains, you could use nginx. Although the above methods are sufficient, if you want to ensure that no access is made to your subdomains by search engines, you can configure Nginx to return a 404
or 403
status code for all user agents or specific user agents like Googlebot. Here's an example configuration that returns a 403 Forbidden
response:
server {
server_name subdomain.website.com;
location / {
if ($http_user_agent ~* (Googlebot|Bingbot|Slurp)) {
return 403;
}
}
}
To exclude subdomains you can also use HTTP headers with nginx. Send the X-Robots-Tag
HTTP header with a value of noindex, nofollow
with your responses to ensure that pages are not indexed.
Here's an example of how to add the X-Robots-Tag in your Nginx configuration:
server_name subdomain.website.com;
location / {
add_header X-Robots-Tag "noindex, nofollow";
}
}
Upvotes: 0
Reputation: 10071
I want Google to drop all of the records of the dev subdomain but keep the www.
If the dev site has already been indexed return a 404 or 410 error to crawlers to delist content.
Is it possible to have one robots.txt file that excludes a subdomain?
If your code is completely static what you're looking for the non-standard host
directive:
User-agent: *
Host: www.example.com
But if you can support a templating language it's possible to keep everything in a single file:
User-agent: *
# if ENVIRONMENT variable is false robots will be disallowed.
{{ if eq (getenv "ENVIRONMENT") "production" }}
Disallow: admin/
Disallow:
{{ else }}
Disallow: /
{{ end }}
Upvotes: 0
Reputation: 31
Keep in mind that if you block Google from indexing the pages under the subdomain, they won't (usually) immediately drop out of the Google index. It merely stops Google from re-indexing those pages.
If the dev subdomain isn't launched yet, make sure it has it's own robots.txt disallowing everything.
However, if the dev subdomain already has pages indexed, then you need to use the robots noindex meta tags first (which requires Google to crawl the pages initially to read this request), then set up the robots.txt file for the dev subdomain once the pages have dropped out of the Google index (set up a Google Webmaster Tools account - it helps to work this out).
Upvotes: 3
Reputation: 18127
You could use Apache rewrite logic to serve a different robots.txt
on the development domain:
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_HOST} ^dev\.qrcodecity\.com$
RewriteRule ^robots\.txt$ robots-dev.txt
</IfModule>
And then create a separate robots-dev.txt
:
User-agent: *
Disallow: /
Upvotes: 29
Reputation: 1132
Sorry, this is most likely not possible. The general rule is that each sub-domain is treated separately and thus would both need robots.txt files.
Often subdomains are implemented using subfolders with url rewriting in place that does the mapping in which you want to share a single robots.txt file across subdomains. Here's a good discussion of how to do this: http://www.webmasterworld.com/apache/4253501.htm.
However, in your case you want different behavior for each subdomain which is going to require separate files.
Upvotes: 5