user1544207
user1544207

Reputation: 60

scrapy fails to crawl craigslist

This same code crawls yellowbook with no issues and as expected. Change the rule over to CL and it hits the first url and then teeters out with no relevant output.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigs.items import CraigsItem

class MySpider(CrawlSpider):
        name = "craigs"
        allowed_domains = ["craiglist.org"]

        start_urls = ["http://newyork.craigslist.org/cpg/"]

        rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('/html/body/blockquote[3]/p/a',)), follow=True, callback='parse_profile')]

        def parse_profile(self, response):
                found = []
                img = CraigsItem()
                hxs = HtmlXPathSelector(response)
                img['title'] = hxs.select('//h2[contains(@class, "postingtitle")]/text()').extract()
                img['text'] = hxs.select('//section[contains(@id, "postingbody")]/text()').extract()
                img['tags'] =  hxs.select('//html/body/article/section/section[2]/section[2]/ul/li[1]').extract()

                print found[0]
                return found[0]

Here is the output http://pastie.org/6087878 As you can see, it has no issue getting the first url to crawl http://newyork.craigslist.org/mnh/cpg/3600242403.html> but then dies.

I can use the CLI and dump all the links like this SgmlLinkExtractor(restrict_xpaths=('/html/body/blockquote[3]/p/a',)).extract_links(response) with xpaths or keyword SgmlLinkExtractor(allow=r'/cpg/.+').extract_links(response)
output -> http://pastie.org/6085322

but in the crawl, the same query fails. WTF??

Upvotes: 1

Views: 1255

Answers (1)

akhter wahab
akhter wahab

Reputation: 4085

if you look in documentation you will see

allowed_domains An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed if OffsiteMiddleware is enabled.

your allowed domain is

 allowed_domains = ["craiglist.org"]

but you are trying to fetch a subdomain

02-07 15:39:03+0000 [craigs] DEBUG: Filtered offsite request to 'newyork.craigslist.org': <GET http://newyork.craigslist.org/mnh/cpg/3600242403.html>

that is why it is filtered

either remove allowed_domains from your crawler of add proper domains in it for avoiding filtered offsite requests

Upvotes: 3

Related Questions