Reputation: 2834
I have a crawler that starts from a sitemap, grabs (a couple) 100 unique urls, and then does further processing on those 100 pages. However, I only get callbacks on the first 10 urls. The spider logs seem to only call HTTP GETs on the first 10 urls.
class MySpider(scrapy.spider.BaseSpider):
# settings ...
def parse(self, response):
urls = [...]
for url in urls:
request = scrapy.http.Request(url, callback=self.parse_part2)
print url
yield request
def parse_part2(self, response):
print response.url
# do more parsing here
I've considered these options:
Is there some mysterious max_branching_factor flag I am not aware of or something?
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url1>
yay callback!
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url2>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url3>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url4>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url5>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url6>
yay callback!
yay callback!
yay callback!
yay callback!
yay callback!
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url7>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url8>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url9>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url10>
yay callback!
2015-02-11 02:05:13-0800 [mysite] INFO: Closing spider (finished)
2015-02-11 02:05:13-0800 [mysite] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4590,
'downloader/request_count': 11,
'downloader/request_method_count/GET': 11,
'downloader/response_bytes': 638496,
'downloader/response_count': 11,
'downloader/response_status_count/200': 11,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 2, 11, 10, 5, 13, 260322),
'log_count/DEBUG': 17,
'log_count/INFO': 3,
'request_depth_max': 1,
'response_received_count': 11,
'scheduler/dequeued': 11,
'scheduler/dequeued/memory': 11,
'scheduler/enqueued': 11,
'scheduler/enqueued/memory': 11,
'start_time': datetime.datetime(2015, 2, 11, 10, 5, 12, 492811)}
2015-02-11 02:05:13-0800 [mysite] INFO: Spider closed (finished)
Upvotes: 1
Views: 257
Reputation: 2834
so I found this attribute in one of my settings files
max_requests / MAX_REQUESTS = 10
which is responsible for the spider quitting early (oops)
Upvotes: 1
Reputation: 2223
try to set LOG_LEVEL to debug you will see more logs.
and if you do so.please paste them on
Upvotes: 0