Reputation: 192
I have a Drupal site that is up and running. The site is not properly optimized for SEO and there is lot of duplicate content that gets generated in google because of the /category, /taxonomy etc
The structure is:
/var/www/appname/ This contains a custom built application /var/www/appname/drup This contains my drupal installation
I went through the site results in a google search site:appname.com and was that there is lot of duplicated content because of /content, /taxonomy, /node etc.
My ROBOTS.txt .. in /var/www/appname has the following already in, but I am surprised that the pages are still getting indexed. Please advise.
User-agent: *
Crawl-delay: 10
Allow: /
Allow: /drup/
# Directories
Disallow: /drup/includes/
Disallow: /drup/misc/
Disallow: /drup/modules/
Disallow: /drup/profiles/
Disallow: /drup/scripts/
Disallow: /drup/themes/
# Files
Disallow: /drup/CHANGELOG.txt
Disallow: /drup/cron.php
Disallow: /drup/INSTALL.mysql.txt
Disallow: /drup/INSTALL.pgsql.txt
Disallow: /drup/install.php
Disallow: /drup/INSTALL.txt
Disallow: /drup/LICENSE.txt
Disallow: /drup/MAINTAINERS.txt
Disallow: /drup/update.php
Disallow: /drup/UPGRADE.txt
Disallow: /drup/xmlrpc.php
# Paths (clean URLs)
Disallow: /drup/admin/
Disallow: /drup/comment/reply/
Disallow: /drup/contact/
Disallow: /drup/logout/
Disallow: /drup/node/add/
Disallow: /drup/search/
Disallow: /drup/user/register/
Disallow: /drup/user/password/
Disallow: /drup/user/login/
# Paths (no clean URLs)
Disallow: /drup/?q=admin/
Disallow: /drup/?q=comment/reply/
Disallow: /drup/?q=contact/
Disallow: /drup/?q=logout/
Disallow: /drup/?q=node/add/
Disallow: /drup/?q=search/
Disallow: /drup/?q=user/password/
Disallow: /drup/?q=user/register/
Disallow: /drup/?q=user/log
Upvotes: 0
Views: 556
Reputation: 750
Do you have the ability to verify ownership of the site with Google Webmaster Tools at:
http://www.google.com/webmasters/tools
If so, I'd recommend doing that, then trying "Fetch as Googlebot" under the "Diagnostics" category for that site. Your "Fetch Status" will indicate "Denied by robots.txt" if your robots.txt is working as expected.
Indexed pages can hang for awhile and display in Google search results after you've changed the robots.txt. But the Fetch As Googlebot gives you a real-time indication of what's happening when Googlebot comes knockin...
If the URLs that you don't want indexed are retrieved without a problem, then you'll need to focus on problems with robots.txt...where it's at, syntax, paths listed, etc. I always suggest people retrieve it manually in the browser (at the root of their web site) to double-check against obvious goofs.
Upvotes: 0
Reputation: 1534
You can disallow the directory that are showing duplicate content. As you explained that the /content, /taxonomy, /node are showing duplicate content.
Add the following code in the Directories section of robots.txt file to restrict access of search engines to these directories.
Disallow: /drup/content/ Disallow: /drup/taxonomy/ Disallow: /drup/node/
Upvotes: 0
Reputation: 5560
There are several modules that take care of SEO and duplicated content. I would first advice to install and go over http://drupal.org/project/seo_checklist For duplicated content you may check http://drupal.org/project/globalredirect
Anyway, /taxonomy and /content are just lists that instead of disallowing you may want to override their paths with some sort of custom content and let crawlers know what they are looking at.
Upvotes: 1
Reputation: 30875
You just need an XML sitemap that tells Google where all the pages are, rather than letting Google crawl it on its own.
In fact, when Stackoverflow was in beta -- they tried to let the crawler work its magic. However, on highly dynamic sites, it's almost impossible to get adequate results in this fashion.
Thus, with the XML sitemap you tell Google where each page is and what its priority is and how often it changes.
Upvotes: 1