Reputation: 1982
I am trying to request multiple pages and store a returned variable from the callback into a list that will be used later in a future request.
def parse1(self,response):
items.append(1)
def parse2(self,response):
items=[]
urls=['https://www.example1.com','https://www.example2.com']
for url in urls:
yield Request(
url,
callback=self.parse1,
dont_filter=True
)
print items
How can this be achieved?
Metas don't help. They input not output values and I want to collect values from a loop of requests.
Upvotes: 3
Views: 6921
Reputation: 1462
This is quite possibly the most often encountered issue for newcomers to Scrapy or async programming in general. (So I'll try for a more comprehensive answer.)
What you're trying to do is this:
Response -> Response -> Response
| <-----------------------'
| \-> Response
| <-----------------------'
| \-> Response
| <-----------------------'
aggregating \-> Response
V
Data out
When what you really have to do in async programming is this chaining of your responses / callbacks:
Response -> Response -> Response -> Response ::> Data out to ItemPipeline (Exporters)
\-> Response -> Response -> Response ::> Data out to ItemPipeline
\-> Response -> Response ::> Data out to ItemPipeline
\> Response ::> Error
So what's needed is a paradigm shift in thinking on how to aggregate your data.
Think of the code flow as a timeline; you can't go back in time - or return a result back in time - only forward.
You can only get the promise of some future work to be done, at the time you schedule it.
So the clever way is to forward yourself the data that you'll be needing at that future point in time.
The major problem I think is that this feels and looks awkward in Python, whereas it looks much more natural in languages like JavaScript, while it's essentially the same.
And that may be even more so the case in Scrapy, because it tries to hide this complexity of Twisted's deferred
s from users.
But you should see some similarities in the following representations:
Random JS example:
new Promise(function(resolve, reject) { // code flow
setTimeout(() => resolve(1), 1000); // |
}).then(function(result) { // v
alert(result); // |
return result * 2; // |
}).then(function(result) { // |
alert(result); // |
return result * 2; // v
});
Style of Twisted deferred's:
(Source: https://twistedmatrix.com/documents/16.2.0/core/howto/defer.html#visual-explanation)
Style in Scrapy Spider callbacks:
scrapy.Request(url,
callback=self.parse, # > go to next response callback
errback=self.erred) # > go to custom error callback
So where does that leave us with Scrapy?
Pass your data along as you go, don't hoard it ;)
This should be sufficient in almost every case, except where you have no choice but to merge Item information from multiple pages, but where those Requests can't be serialized into the following schema (more on that later).
->- flow of data ---->---------------------->
Response -> Response
`-> Data -> Req/Response
Data `-> MoreData -> Yield Item to ItemPipeline (Exporters)
Data -> Req/Response
`-> MoreData -> Yield Item to ItemPipeline
1. Gen 2. Gen 3. Gen
How you implement this model in code will depend on your use-case.
Scrapy provides the meta
field in Requests/Responses for slugging along data.
Despite the name it's not really 'meta', but rather quite essential. Don't avoid it, get used to it.
Doing that might seem counterintuitive, heaping along and duplicating all that data into potentially thousands newly spawned requests; but because of the way Scrapy handles references, it's not actually bad, and old objects get cleaned up early by Scrapy. In the above ASCII art, by the time your 2nd generation requests are all queued up, the 1st generation responses will be freed from memory by Scrapy, and so on. So this isn't really the memory-bloat one might think, if used correctly (and not handling lots of big files).
Another possibility to 'meta' are instance variables (global data), to store stuff in some self.data
object or other, and access it in the future from your next response callback.
(Never in the old one, since at that time it did not exist yet.)
When doing this, remember always that it's global shared data, of course; which might have "parallel" callbacks looking at it.
And then finally sometimes one might even use external sources, like Redis-Queues or sockets to communicate data between Spider and a datastore (for example to pre-fill the start_urls).
And how could this look in code?
You can write "recursive" parse methods (actually just funnel all responses through the same callback method):
def parse(self, response):
if response.xpath('//li[@class="next"]/a/@href').extract_first():
yield scrapy.Request(response.urljoin(next_page_url)) # will "recurse" back to parse()
if 'some_data' in reponse.body:
yield { # the simplest item is a dict
'statuscode': response.body.status,
'data': response.body,
}
or you can split between multiple parse
methods, each handling a specific type of page/Response:
def parse(self, response):
if response.xpath('//li[@class="next"]/a/@href').extract_first():
request = scrapy.Request(response.urljoin(next_page_url))
request.callback = self.parse2 # will go to parse2()
request.meta['data'] = 'whatever'
yield request
def parse2(self, response):
data = response.meta.get('data')
# add some more data
data['more_data'] = response.xpath('//whatever/we/@found').extract()
# yield some more requests
for url in data['found_links']:
request = scrapy.Request(url, callback=self.parse3)
request.meta['data'] = data # and keep on passing it along
yield request
def parse3(self, response):
data = response.meta.get('data')
# ...workworkwork...
# finally, drop stuff to the item-pipelines
yield data
Or even combine it like this:
def parse(self, response):
data = response.meta.get('data', None)
if not data: # we are on our first request
if response.xpath('//li[@class="next"]/a/@href').extract_first():
request = scrapy.Request(response.urljoin(next_page_url))
request.callback = self.parse # will "recurse" back to parse()
request.meta['data'] = 'whatever'
yield request
return # stop here
# else: we already got data, continue with something else
for url in data['found_links']:
request = scrapy.Request(url, callback=self.parse3)
request.meta['data'] = data # and keep on passing it along
yield request
But this REALLY isn't good enough for my case!
Finally, one can consider these more complex approaches, to handle flow control, so those pesky async calls become predictable:
Force serialization of interdependent requests, by changing the request flow:
def start_requests(self):
url = 'https://example.com/final'
request = scrapy.Request(url, callback=self.parse1)
request.meta['urls'] = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3',
]
yield request
def parse1(self, response):
urls = response.meta.get('urls')
data = response.meta.get('data')
if not data:
data = {}
# process page response somehow
page = response.xpath('//body').extract()
# and remember it
data[response.url] = page
# keep unrolling urls
try:
url = urls.pop()
request = Request(url, callback=self.parse1) # recurse
request.meta['urls'] = urls # pass along
request.meta['data'] = data # to next stage
return request
except IndexError: # list is empty
# aggregate data somehow
item = {}
for url, stuff in data.items():
item[url] = stuff
return item
Another option for this are scrapy-inline-requests
, but be aware of the downsides as well (read the project README).
@inline_requests
def parse(self, response):
urls = [response.url]
for i in range(10):
next_url = response.urljoin('?page=%d' % i)
try:
next_resp = yield Request(next_url, meta={'handle_httpstatus_all': True})
urls.append(next_resp.url)
except Exception:
self.logger.info("Failed request %s", i, exc_info=True)
yield {'urls': urls}
Aggregate data in instance storage ("global data") and handle flow control through either or both
pydispatch
signals for "out-of-band"
notifications. While those are not really lightweight, they're a whole different layer
to handle events and notifications.This is a simple way to use custom Request priorities:
custom_settings = {
'CONCURRENT_REQUESTS': 1,
}
data = {}
def parse1(self, response):
# prioritize these next requests over everything else
urls = response.xpath('//a/@href').extract()
for url in urls:
yield scrapy.Request(url,
priority=900,
callback=self.parse2,
meta={})
final_url = 'https://final'
yield scrapy.Request(final_url, callback=self.parse3)
def parse2(self, response):
# handle prioritized requests
data = response.xpath('//what/we[/need]/text()').extract()
self.data.update({response.url: data})
def parse3(self, response):
# collect data, other requests will have finished by now
# IF THE CONCURRENCY IS LIMITED, otherwise no guarantee
return self.data
And a basic example using signals.
This listens to the internal idle
event, when the Spider has crawled all requests and is sitting pretty, to use it for doing last-second cleanup (in this case, aggregating our data). We can be absolutely certain that we won't be missing out on any data at this point.
from scrapy import signals
class SignalsSpider(Spider):
data = {}
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(Spider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.idle, signal=signals.spider_idle)
return spider
def idle(self, spider):
if self.ima_done_now:
return
self.crawler.engine.schedule(self.finalize_crawl(), spider)
raise DontCloseSpider
def finalize_crawl(self):
self.ima_done_now = True
# aggregate data and finish
item = self.data
return item
def parse(self, response):
if response.xpath('//li[@class="next"]/a/@href').extract_first():
yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse2)
def parse2(self, response):
# handle requests
data = response.xpath('//what/we[/need]/text()').extract()
self.data.update({response.url: data})
A final possibility is using external sources like message-queues or redis, as already mentioned, to control the spider flow from outside. And that covers all the ways I can think of.
Once an Item is yielded/returned to the Engine, it will be passed to the ItemPipeline
s (which can make use of Exporters
- not to be confused with FeedExporters
),
where you can continue to massage the data outside the Spider.
A custom ItemPipeline
implementation might store the items in a database, or do any number of exotic processing things on them.
Hope this helps.
(And feel free to edit this with better text or examples, or fix any errors there may be.)
Upvotes: 24
Reputation: 21436
If I understand you correctly what you want is a while chain
Pseudo code:
queue = get_queue()
items = []
while queue is not empty:
items.append(crawl1())
crawl2(items)
In scrapy this is a bit ugly but not difficult:
default_queue = ['url1', 'url2']
def parse(self, response):
queue = response.meta.get('queue', self.default_queue)
items = response.meta.get('items', [])
if not queue:
yield Request(make_url_from_items(items), self.parse_items)
return
url = queue.pop()
item = {
# make item from resposne
}
items.append(item)
yield Request(url, meta={'queue':queue, 'items': items})
This will loop parse untill queue
is empty and then yield a new request made from results. It should be noted that this would become a synchronous chain, however if you have multiple start_urls you'd still have async spider that just have multiple syncronous chains :)
Upvotes: 0