Reputation: 8958
I'm trying to iterate through a list of URL's return from the callback passed to scrapy request, but I'm getting the following error:
TypeError: 'Request' object is not iterable
The following works. I can see all the extracted URL's flood the terminal:
import scrapy
class PLSpider(scrapy.Spider):
name = 'pl'
start_urls = [ 'https://example.com' ]
def genres(self, resp):
for genre in resp.css('div.sub-menus a'):
yield {
'genre': genre.css('::text').extract_first(),
'url': genre.css('::attr(href)').extract_first()
}
def extractSamplePackURLs(self, resp):
return {
'packs': resp.css('h4.product-title a::attr(href)').extract()
}
def extractPackData(self, resp):
return {
'title': resp.css('h1.product-title::text'),
'description': resp.css('div.single-product-description p').extract_first()
}
def parse(self, resp):
for genre in self.genres(resp):
samplePacks = scrapy.Request(genre['url'], callback=self.extractSamplePackURLs)
yield samplePacks
But if I replace the yield samplePacks
line with:
def parse(self, resp):
for genre in self.genres(resp):
samplePacks = scrapy.Request(genre['url'], callback=self.extractSamplePackURLs)
for pack in samplePacks:
yield pack
... I get the error I posted above.
Why is this and how can I loop through the returned value of the callback?
Upvotes: 1
Views: 3072
Reputation: 20748
Yielding Request
objects in scrapy.Spider
callbacks only tells Scrapy framework to enqueue HTTP requests. It yields HTTP requests objects, just that. It does not download them immediately. Or give back control until they are downloaded, ie. after the yield, you still don't have the result. Request
objects are not promises, futures, deferred. Scrapy is not designed the same as various async frameworks.
These Request
objects will eventually get processed by the framework's downloader, and the response body from each HTTP request will be passed to the associated callback.
This is the basis of Scrapy's asynchronous programming pattern.
If you want to do something more "procedural-like" in which yield request(...)
gets you the HTTP response the next time you have control, you can have a look at https://github.com/rmax/scrapy-inline-requests/.
Upvotes: 3