scrapy fails to crawl craigslist

Question

This same code crawls yellowbook with no issues and as expected. Change the rule over to CL and it hits the first url and then teeters out with no relevant output.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigs.items import CraigsItem

class MySpider(CrawlSpider):
        name = "craigs"
        allowed_domains = ["craiglist.org"]

        start_urls = ["http://newyork.craigslist.org/cpg/"]

        rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('/html/body/blockquote[3]/p/a',)), follow=True, callback='parse_profile')]

        def parse_profile(self, response):
                found = []
                img = CraigsItem()
                hxs = HtmlXPathSelector(response)
                img['title'] = hxs.select('//h2[contains(@class, "postingtitle")]/text()').extract()
                img['text'] = hxs.select('//section[contains(@id, "postingbody")]/text()').extract()
                img['tags'] =  hxs.select('//html/body/article/section/section[2]/section[2]/ul/li[1]').extract()

                print found[0]
                return found[0]

Here is the output http://pastie.org/6087878 As you can see, it has no issue getting the first url to crawl http://newyork.craigslist.org/mnh/cpg/3600242403.html> but then dies.

I can use the CLI and dump all the links like this SgmlLinkExtractor(restrict_xpaths=('/html/body/blockquote[3]/p/a',)).extract_links(response) with xpaths or keyword SgmlLinkExtractor(allow=r'/cpg/.+').extract_links(response)
output -> http://pastie.org/6085322

but in the crawl, the same query fails. WTF??

scrapy fails to crawl craigslist

Answers (1)

Related Questions