Reputation: 179
I am trying to scrape several websites using beautiful soup and mechanize libraries in python. However, I came across a website with the following robots.txt
User-Agent: *
Allow: /$
Disallow: /
According to wikipedia, Allow directive counteracts the following Disallow directive. I have read about simpler examples, and understand how it works, but this situation is a bit confusing for me. Am I right to assume that my crawler is allowed to access everything on this website? If yes, it seems really strange that the website would even bother writing robots.txt in the first place...
Extra information:
Mechanize gave me an error when I tried to scrape this website, and the error was something along the lines of Http error 403, crawling is prohibited because of robots.txt
. If my assumption stated above is correct, then I think the reason mechanize returned an error while trying to access the website is because it is either not equipped to handle such robots.txt or it follows a different standard of interpreting robots.txt files. (In which case I will just have to make my crawler ignore robots.txt)
Update:
I just stumbled upon this question
robots.txt allow root only, disallow everything else?
in particular, I looked at the @eywu's answer, and now I think my initial assumption was wrong, and I am only allowed to access website.com but not website.com/other-stuff
Upvotes: 2
Views: 298
Reputation: 1121634
No, your crawler can only access the homepage.
The Allow
directive lets you access /$
; the $
is significant here! It means only the literal /
path matches, any other path (like /foo/bar
) is not allowed as per the Disallow
directive, which matches all paths (it has no $
).
See the Google documentation on path matching:
$
designates the end of the URL
Mechanize correctly interpreted the robots.txt
file.
Upvotes: 0
Reputation: 15953
Your update is correct. You can access http://example.com/
, but not http://example.com/page.htm
.
This is taken from Robots.txt Specifications, look at the very bottom of the page in the section titled "Order of precedence for group-member records" which states:
URL allow: disallow: Verdict Comments
http://example.com/page /p / allow
http://example.com/folder/page /folder/ /folder allow
http://example.com/page.htm /page /*.htm undefined
http://example.com/ /$ / allow
http://example.com/page.htm /$ / disallow
Upvotes: 3