Reputation: 31
Similar to what is done in the link: How can i use multiple requests and pass items in between them in scrapy python
I am trying to chain requests from spiders like in Dave McLain's answer. Returning a request object from parse function works fine, allowing the spider to continue with the next request.
def parse(self, response):
# Some operations
self.url_index += 1
if self.url_index < len(self.urls):
return scrapy.Request(url=self.urls[self.url_index], callback=self.parse)
return items
However, I have the default Spider Middleware where I do some caching and logging operations in the spider_process_output. Returning the request object from the parse function first goes into middleware. So, the middleware has to return the request object as well.
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
if hasattr(spider, 'multiple_urls'):
if spider.url_index + 1 < len(spider.urls):
return [result]
# return [scrapy.Request(url=spider.urls[spider.url_index], callback=spider.parse)]
# Some operations ...
According to the documentation, it must return iterable of Request, or item objects. However, when I return the result (which is a Request object), or construct a new request object (as in the comment), the spider just terminates (by giving spider finished signal) without making a new request.
Documentation link: https://docs.scrapy.org/en/latest/topics/spider-middleware.html#writing-your-own-spider-middleware
I am not sure if there is an issue with the documentation or the way I interpret it. But, returning request objects from the middleware doesn't make new request, instead it terminates the flow.
Upvotes: 0
Views: 605
Reputation: 31
It was quite simple yet frustrating to solve the problem. The middleware is supposed to return iterable of request objects. However, putting the request object into a list (which is an iterable) doesn't seem to work. Using yield result
in the process_spider_output middleware function instead works.
Since the main issue is resolved, I'll leave this answer as a reference. Better explanations of why this is the case are appreciated.
Upvotes: 1