Anil Shanbhag
Anil Shanbhag

Reputation: 968

Using python robotparser

I am not understandong how to use the parse function in robotparser module . Here is what I tried :

In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt")

In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
Sitemap: http://anilattech.wordpress.com/sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
# MT refugees
User-agent: *
Disallow: /cgi-bin/
User-agent: *
Disallow:""")

In [48]: rp.can_fetch("*","http://anilattech.wordpress.com/signup/")
Out[48]: True

As it seems the rp.entries is [] . I am not understanding what is wrong . I have tried simpler example but same problem .

Upvotes: 0

Views: 535

Answers (2)

Anil Shanbhag
Anil Shanbhag

Reputation: 968

Well I just found the answer .

1 . The thing was that this robots.txt [ from wordpress.com ] contained multiple User Agent declarations . This was not supported by robotparser module . I tiny hack of removing the excessive User-agent: * lines solved the problem .

2 . The argument to parse is list as was pointed by Andrew .

Upvotes: 0

Andrew Wilkinson
Andrew Wilkinson

Reputation: 10846

There are two issues here. Firstly the rp.parse method takes a list of strings, so you should add .split("\n") to that line.

The second issue is that rules for the * user agent are stored in rp.default_entry rather than rp.entries. If you check that you'll see it contains an Entry object.

I'm not sure who is at fault here, but the Python implementation of the parser only respects the first User-agent: * section so in the example you've given only /next/ is disallowed. The other disallow lines are ignored. I haven't read the spec so I can't say if this is a malformed robots.txt file or if the Python code is wrong. I would assume the former though.

Upvotes: 1

Related Questions