CrawlSpider only crawls start_urls

Question

All, I try to build a spider which will crawl the sherdog website and give me the profile of all fighters (name, birthdate, height, nationality). If I run the spider it parses the start_urls fine. The problem is that the crawlspider will not start crawling, so I end up with only 2 parsed items. I read the doc's but also new to Scrapy so I might miss something. You have any idea what? The website uses relative url's so I first thought that might be the issue, but after building the absolute url's it still didnt work. I really hope you guys can help me out!

class ProfileSpyder(CrawlSpider):
  name = "Profile"
  allowed_domains = ["http://www.sherdog.com/fighter/"]
  start_urls = ["http://www.sherdog.com/fighter/Daniel-Cormier-52311", 
            "http://www.sherdog.com/fighter/Ronda-Rousey-73073"]

  Rules = (
      Rule(LinkExtractor(allow=('/fighter/')), callback='parse_item', follow=True)      
      )

  def parse_item(self, response):
    #Build absolute urls and sent new requests
    for href in   response.xpath("/html/body/div[3]/div[2]/div[1]/section[2]/div"):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_item)       
    #Parse item                     
    item = FighterProfile()
    item['Name'] =   response.xpath('.//section/div/h1/span[@class="fn"]/text()').extract()
    item['Birthdate'] = response.xpath('.//section/div/div/div/div/div/div/span/span[@itemprop="birthDate"]/text()').extract()
    item['Height'] = response.xpath('.//section/div/div/div/div/div/div/span[@class="item height"]/strong/text()').extract()
    item['Nationality'] = response.xpath('.//section[1]/div/div[1]/div[1]/div/div[1]/div[1]/span[2]/strong/text()').extract()
    yield item

And the log:

2015-12-07 18:15:11 [scrapy] INFO: Scrapy 1.0.3 started (bot: ufcfights)
2015-12-07 18:15:11 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-12-07 18:15:11 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ufcfights.spiders', 'FEED_URI': 'items.csv', 'SPIDER_MODULES': ['ufcfights.spiders'], 'BOT_NAME': 'ufcfights', 'USER_AGENT': 'Chrome/46.0.2490.80', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 1.0}
2015-12-07 18:15:12 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2015-12-07 18:15:12 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-12-07 18:15:12 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-12-07 18:15:12 [scrapy] INFO: Enabled item pipelines:
2015-12-07 18:15:12 [scrapy] INFO: Spider opened
2015-12-07 18:15:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-07 18:15:12 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-12-07 18:15:12 [scrapy] DEBUG: Crawled (200)  (referer: None)
2015-12-07 18:15:13 [scrapy] DEBUG: Crawled (200)  (referer: None)
2015-12-07 18:15:14 [scrapy] INFO: Closing spider (finished)
2015-12-07 18:15:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 452,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 46874,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 12, 7, 17, 15, 14, 92000),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2015, 12, 7, 17, 15, 12, 618000)}
 2015-12-07 18:15:14 [scrapy] INFO: Spider closed (finished)

eLRuLL · Accepted Answer

You can't override the parse method when using a CrawlSpider, it uses parse internally for the rules. Check the warning here

Just change the callback method on the rules.

CrawlSpider only crawls start_urls

Answers (1)

Related Questions