Reputation: 60
This same code crawls yellowbook with no issues and as expected. Change the rule over to CL and it hits the first url and then teeters out with no relevant output.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigs.items import CraigsItem
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craiglist.org"]
start_urls = ["http://newyork.craigslist.org/cpg/"]
rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('/html/body/blockquote[3]/p/a',)), follow=True, callback='parse_profile')]
def parse_profile(self, response):
found = []
img = CraigsItem()
hxs = HtmlXPathSelector(response)
img['title'] = hxs.select('//h2[contains(@class, "postingtitle")]/text()').extract()
img['text'] = hxs.select('//section[contains(@id, "postingbody")]/text()').extract()
img['tags'] = hxs.select('//html/body/article/section/section[2]/section[2]/ul/li[1]').extract()
print found[0]
return found[0]
Here is the output http://pastie.org/6087878 As you can see, it has no issue getting the first url to crawl http://newyork.craigslist.org/mnh/cpg/3600242403.html> but then dies.
I can use the CLI and dump all the links like this SgmlLinkExtractor(restrict_xpaths=('/html/body/blockquote[3]/p/a',)).extract_links(response) with xpaths or keyword SgmlLinkExtractor(allow=r'/cpg/.+').extract_links(response)
output -> http://pastie.org/6085322
but in the crawl, the same query fails. WTF??
Upvotes: 1
Views: 1255
Reputation: 4085
if you look in documentation you will see
allowed_domains An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed if OffsiteMiddleware is enabled.
your allowed domain is
allowed_domains = ["craiglist.org"]
but you are trying to fetch a subdomain
02-07 15:39:03+0000 [craigs] DEBUG: Filtered offsite request to 'newyork.craigslist.org': <GET http://newyork.craigslist.org/mnh/cpg/3600242403.html>
that is why it is filtered
either remove allowed_domains
from your crawler of add proper domains in it for avoiding filtered offsite requests
Upvotes: 3