Reputation: 968
I am not understandong how to use the parse function in robotparser module . Here is what I tried :
In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt")
In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
Sitemap: http://anilattech.wordpress.com/sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
# MT refugees
User-agent: *
Disallow: /cgi-bin/
User-agent: *
Disallow:""")
In [48]: rp.can_fetch("*","http://anilattech.wordpress.com/signup/")
Out[48]: True
As it seems the rp.entries is [] . I am not understanding what is wrong . I have tried simpler example but same problem .
Upvotes: 0
Views: 535
Reputation: 968
Well I just found the answer .
1 . The thing was that this robots.txt [ from wordpress.com ] contained multiple User Agent declarations . This was not supported by robotparser module . I tiny hack of removing the excessive User-agent: *
lines solved the problem .
2 . The argument to parse is list as was pointed by Andrew .
Upvotes: 0
Reputation: 10846
There are two issues here. Firstly the rp.parse
method takes a list of strings, so you should add .split("\n")
to that line.
The second issue is that rules for the *
user agent are stored in rp.default_entry
rather than rp.entries
. If you check that you'll see it contains an Entry
object.
I'm not sure who is at fault here, but the Python implementation of the parser only respects the first User-agent: *
section so in the example you've given only /next/
is disallowed. The other disallow lines are ignored. I haven't read the spec so I can't say if this is a malformed robots.txt file or if the Python code is wrong. I would assume the former though.
Upvotes: 1