Reputation:
For some reason my crawler is only crawling a couple domains. It's supposed to at least follow all the urls on the start page. Also, this is on craigslist, I'm not sure if they are known for blocking crawlers. Any idea what's going on?
Here's the output:
2012-07-01 15:02:56-0400 [craigslist] INFO: Spider opened
2012-07-01 15:02:56-0400 [craigslist] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-01 15:02:56-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6026
2012-07-01 15:02:56-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6083
2012-07-01 15:02:57-0400 [craigslist] DEBUG: Crawled (200) <GET http://boston.craigslist.org/search/fua?query=chest+of+drawers> (referer: None)
2012-07-01 15:02:57-0400 [craigslist] DEBUG: Crawled (200) <GET http://boston.craigslist.org/fua/> (referer: None)
2012-07-01 15:02:57-0400 [craigslist] DEBUG: Filtered offsite request to 'boston.craigslist.org': <GET http://boston.craigslist.org/sob/fud/3112540401.html>
2012-07-01 15:02:57-0400 [craigslist] INFO: Closing spider (finished)
And here's the code:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from craigslist.items import CraigslistItem
from scrapy.http import Request
class BostonCragistlistSpider(CrawlSpider):
name = 'craigslist'
allowed_domains = ['http://boston.craigslist.org']
start_urls = ['http://boston.craigslist.org/search/fua?query=chest+of+drawers']
rules = (
Rule(SgmlLinkExtractor(allow=r'\/[a-z]{3}\/[a-z]{3}\/.*\.html'), callback='get_image', follow=True),
# looking for domains like:
# http://boston.craigslist.org/sob/fud/3111565340.html
# http://boston.craigslist.org/gbs/fuo/3112103005.html
Rule(SgmlLinkExtractor(allow=r'\/search\/fua\?query=\.*'), callback='extract_links', follow=True),
)
def extract_links(self, response):
print 'extracting links'
links = hxs.select('//p[@class="row"]//a/@href').extract()
for link in links:
return Request(link, callback=self.get_image)
def get_image(self, response):
print 'parsing'
hxs = HtmlXPathSelector(response)
images = hxs.select('//img//@src').extract()
Any thoughts would be greatly appreciated!
Upvotes: 0
Views: 1464
Reputation: 2254
allowed_domains needs to contain domain names, not URLs. Change it to:
allowed_domains = ['boston.craigslist.org']
You can see from your logs that the requests were getting filtered by the offsite middleware (the component that removes URLs outside allowed_domains):
15:02:57-0400 [craigslist] DEBUG: Filtered offsite request to 'boston.craigslist.org': http://boston.craigslist.org/sob/fud/3112540401.html>
Upvotes: 2