Reputation: 11917
I read a little bit about robots.txt and I read I should disallow all folders in my web application, but I would like to allow bots to read main page and one view (url is for example: www.mywebapp/searchresults - it's a codeigniter route - it's called from application/controller/function).
Folder structure for example is:
-index.php(should be able to read by bots)
-application
-controllers
-controller(here is a function which load view)
-views
-public
Should I create robots.txt like this:
User-agent: *
Disallow: /application/
Disallow: /public/
Allow: /application/controllers/function
or using routes something like
User-agent: *
Disallow: /application/
Disallow: /public/
Allow: /www.mywebapp/searchresults
or maybe using views?
User-agent: *
Disallow: /application/
Disallow: /public/
Allow: /application/views/search/index.php
Thanks!
Upvotes: 1
Views: 13835
Reputation: 11917
Answer to my own, old question:
When we would like to allow bots to read some page, we need use our URL (routing) so in this case:
Allow: /www.mywebapp/searchresults
In some cases we also could disallow some pages by HTML tag (add to header):
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
When we would like to block some folder i.e with pictures just do:
Disallow: /public/images
Upvotes: 1
Reputation: 139
You don't block the view file as that isn't directly accessible to the crawlers. You need to block the URL that is used to access your view
The robots.txt file MUST be placed in the document root of the host. It won’t work in other locations.
If your host is www.example.com, it needs to be accessible at http://www.example.com/robots.txt
To remove directories or individual pages of your website, you can place a robots.txt file at the root of your server.When creating your robots.txt file, please keep the following in mind: When deciding which pages to crawl on a particular host, Googlebot will obey the first record in the robots.txt file with a User-agent starting with "Googlebot." If no such entry exists, it will obey the first entry with a User-agent of "". Additionally, Google has introduced increased flexibility to the robots.txt file standard through the use asterisks. Disallow patterns may include "" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name.
To remove all pages under a particular directory (for example, listings), you'd use the following robots.txt entry:
User-agent: Googlebot
Disallow: /listings
To remove all files of a specific file type (for example, .gif), you'd use the following robots.txt entry:
User-agent: Googlebot
Disallow: /*.gif$
To remove dynamically generated pages, you'd use this robots.txt entry:
User-agent: Googlebot
Disallow: /*?
Option 2: Meta tags
Another standard, which can be more convenient for page-by-page use, involves adding a <META> tag to an HTML page to tell robots not to index the page. This standard is described at http://www.robotstxt.org/wc/exclusion.html#meta.
To prevent all robots from indexing a page on your site, you'd place the following meta tag into the <HEAD> section of your page:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
To allow other robots to index the page on your site, preventing only Search Engine's robots from indexing the page, you'd use the following tag:
<META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">
To allow robots to index the page on your site but instruct them not to follow outgoing links, you'd use the following tag:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
for further reference
https://www.elegantthemes.com/blog/tips-tricks/how-to-create-and-configure-your-robots-txt-file
Upvotes: 0