Reputation: 22275

Security concerns using robots.txt

I'm trying to prevent web search crawlers from indexing certain private pages on my web server. The instructions are to include these in the robots.txt file and place it into the root of my domain.

But I have an issue with such approach, mainly, anyone can go to www.mywebsite.com/robots.txt and see the results as such:

# robots.txt for Sites
# Do Not delete this file.

User-agent: *
Disallow: /php/dontvisit.php
Disallow: /hiddenfolder/

that will tell anyone the pages I don't want anyone to go to.

Any idea how to avoid this?

PS. Here's an example of a page that I don't want to be exposed to the public: PayPal validation page for my software license payment. The page logic will not let a dud request through, but it wastes bandwidth (for PayPal connection, as well as for validation on my server) plus it logs a connection-attempt entry into the database.

PS2. I don't know how the URL for this page got out "to the public". It is not listed anywhere but with the PayPal and via .php scripts on my server. The name of the page itself is something like: /php/ipnius726.php so it's not something simple that a crawler can just guess.

Upvotes: 0

Answers (5)

BadgerAndK

Reputation: 44

This is to supplement David Underwood's and unor's answers (not having enough rep points I am left with just answering the question). Recent digging is showing that Google has a clause that allows them to ignore the previously respected robots file on top of other security concerns. The link is a blog from Zac Gery explaining the new(er) policy and some simple explanations of how to "force" Google search engine to be nice. I realize this isn't precisely what you are looking for but on the QA and security side, I have found it to be very useful.

http://zacgery.blogspot.com/2013/01/why-robotstxt-file-is-no-longer.html

Upvotes: 1

unor

Reputation: 96607

You said you don’t want to rewrite your URLs (e.g., so that all disallowed URLs start with the same path segment).

Instead, you could also specify incomplete URL paths, which wouldn’t require any rewrite.

So to disallow /php/ipnius726.php, you could use the following robots.txt:

User-agent: *
Disallow: /php/ipn

This will block all URLs whose path starts with /php/ipn, for example:

http://example.com/php/ipn
http://example.com/php/ipn.html
http://example.com/php/ipn/
http://example.com/php/ipn/foo
http://example.com/php/ipnfoobar
http://example.com/php/ipnius726.php

Upvotes: 1

David Underwood

Reputation: 4966

URLs are public. End of discussion. You have to assume that if you leave a URL unchanged for long enough, it'll be visited.

What you can do is:

Secure access to the functionality behind those URLs
Ask people nicely not to visit them

There are many ways to achieve number 1, but the simplest way would be with some kind of session token given to authorized users.

Number 2 is achieved using robots.txt, as you mention. The big crawlers will respect the contents of that file and leave the pages listed there alone.

That's really all you can do.

Upvotes: 3

Wyzard

Reputation: 34563

If you put your "hidden" pages under a subdirectory, something like private, then you can just Disallow: /private without exposing the names of anything within that directory.

Another trick I've seen suggested is to create a sort of honeypot for dishonest robots by explicitly listing a file that isn't actually part of your site, just to see who requests it. Something like Disallow: /honeypot.php, and you know that any requests for honeypot.php are from a client that's scraping your robots.txt, so you can blacklist that User-Agent string or IP address.

Upvotes: 1

Sneftel

Reputation: 41474

You can put the stuff you want to keep both uncrawled and obscure into a subfolder. So, for instance, put the page in /hiddenfolder/aivnafgr/hfaweufi.php (where aivnafgr is the only subfolder of hiddenfolder, but just put hiddenfolder in your robots.txt.

Upvotes: 1

Security concerns using robots.txt

Answers (5)

Related Questions