Scrapy only processes first 10 requests in iterable

Question

I have a crawler that starts from a sitemap, grabs (a couple) 100 unique urls, and then does further processing on those 100 pages. However, I only get callbacks on the first 10 urls. The spider logs seem to only call HTTP GETs on the first 10 urls.

class MySpider(scrapy.spider.BaseSpider):

    # settings ... 

    def parse(self, response):
        urls = [...]
        for url in urls:
            request = scrapy.http.Request(url, callback=self.parse_part2)
            print url
            yield request

    def parse_part2(self, response):
        print response.url
        # do more parsing here

I've considered these options:

scrambling the list
setting download delays (pretty sure I'm not getting rate limited)
dont_filter=True arg
returning an array of requests instead of yielding
disabling parallel requests

Is there some mysterious max_branching_factor flag I am not aware of or something?

edit : logs, perfectly normal.

2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) 
yay callback!
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) 
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200)  
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200)  
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200)  
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200)  
yay callback!
yay callback!
yay callback!
yay callback!
yay callback!
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200)  
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200)  
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200)  
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200)  
yay callback!
2015-02-11 02:05:13-0800 [mysite] INFO: Closing spider (finished)
2015-02-11 02:05:13-0800 [mysite] INFO: Dumping Scrapy stats:
  {'downloader/request_bytes': 4590,
   'downloader/request_count': 11,
   'downloader/request_method_count/GET': 11,
   'downloader/response_bytes': 638496,
   'downloader/response_count': 11,
   'downloader/response_status_count/200': 11,
   'finish_reason': 'finished',
   'finish_time': datetime.datetime(2015, 2, 11, 10, 5, 13, 260322),
   'log_count/DEBUG': 17,
   'log_count/INFO': 3,
   'request_depth_max': 1,
   'response_received_count': 11,
   'scheduler/dequeued': 11,
   'scheduler/dequeued/memory': 11,
   'scheduler/enqueued': 11,
   'scheduler/enqueued/memory': 11,
   'start_time': datetime.datetime(2015, 2, 11, 10, 5, 12, 492811)}
2015-02-11 02:05:13-0800 [mysite] INFO: Spider closed (finished)

James · Accepted Answer

so I found this attribute in one of my settings files

max_requests / MAX_REQUESTS = 10

which is responsible for the spider quitting early (oops)

Scrapy only processes first 10 requests in iterable

edit : logs, perfectly normal.

Answers (2)

Related Questions