What does it mean if robots.txt allows everything and disallows everything?

Question

I am trying to scrape several websites using beautiful soup and mechanize libraries in python. However, I came across a website with the following robots.txt

User-Agent: *
Allow: /$
Disallow: /

According to wikipedia, Allow directive counteracts the following Disallow directive. I have read about simpler examples, and understand how it works, but this situation is a bit confusing for me. Am I right to assume that my crawler is allowed to access everything on this website? If yes, it seems really strange that the website would even bother writing robots.txt in the first place...

Extra information: Mechanize gave me an error when I tried to scrape this website, and the error was something along the lines of Http error 403, crawling is prohibited because of robots.txt. If my assumption stated above is correct, then I think the reason mechanize returned an error while trying to access the website is because it is either not equipped to handle such robots.txt or it follows a different standard of interpreting robots.txt files. (In which case I will just have to make my crawler ignore robots.txt)

Update:

I just stumbled upon this question

robots.txt allow root only, disallow everything else?

in particular, I looked at the @eywu's answer, and now I think my initial assumption was wrong, and I am only allowed to access website.com but not website.com/other-stuff

Leb · Accepted Answer

Your update is correct. You can access http://example.com/, but not http://example.com/page.htm.

This is taken from Robots.txt Specifications, look at the very bottom of the page in the section titled "Order of precedence for group-member records" which states:

URL allow:  disallow:   Verdict Comments
http://example.com/page /p  /   allow    
http://example.com/folder/page  /folder/    /folder allow    
http://example.com/page.htm /page   /*.htm  undefined    
http://example.com/ /$  /   allow    
http://example.com/page.htm /$  /   disallow

What does it mean if robots.txt allows everything and disallows everything?

Answers (2)

Related Questions