Maciek
Maciek

Reputation: 1982

Scrapy/Python getting items from yield requests

I am trying to request multiple pages and store a returned variable from the callback into a list that will be used later in a future request.

def parse1(self,response):
    items.append(1)

def parse2(self,response):
    items=[]
    urls=['https://www.example1.com','https://www.example2.com']
    for url in urls:
        yield Request(
            url,
            callback=self.parse1,
            dont_filter=True
        )
    print items

How can this be achieved?

Metas don't help. They input not output values and I want to collect values from a loop of requests.

Upvotes: 3

Views: 6921

Answers (2)

nyov
nyov

Reputation: 1462

This is quite possibly the most often encountered issue for newcomers to Scrapy or async programming in general. (So I'll try for a more comprehensive answer.)

What you're trying to do is this:

Response -> Response -> Response
   | <-----------------------'
   |                \-> Response
   | <-----------------------'
   |                \-> Response
   | <-----------------------'
aggregating         \-> Response
   V 
  Data out 

When what you really have to do in async programming is this chaining of your responses / callbacks:

Response -> Response -> Response -> Response ::> Data out to ItemPipeline (Exporters)
        \-> Response -> Response -> Response ::> Data out to ItemPipeline
                    \-> Response -> Response ::> Data out to ItemPipeline
                     \> Response ::> Error

So what's needed is a paradigm shift in thinking on how to aggregate your data.

Think of the code flow as a timeline; you can't go back in time - or return a result back in time - only forward. You can only get the promise of some future work to be done, at the time you schedule it.
So the clever way is to forward yourself the data that you'll be needing at that future point in time.

The major problem I think is that this feels and looks awkward in Python, whereas it looks much more natural in languages like JavaScript, while it's essentially the same.

And that may be even more so the case in Scrapy, because it tries to hide this complexity of Twisted's deferreds from users.

But you should see some similarities in the following representations:


  • Random JS example:

    new Promise(function(resolve, reject) { // code flow
      setTimeout(() => resolve(1), 1000);   //  |
    }).then(function(result) {              //  v
      alert(result);                        //  |
      return result * 2;                    //  |
    }).then(function(result) {              //  |
      alert(result);                        //  |
      return result * 2;                    //  v
    });
    
  • Style of Twisted deferred's:

    Twisted deferreds
    (Source: https://twistedmatrix.com/documents/16.2.0/core/howto/defer.html#visual-explanation)

  • Style in Scrapy Spider callbacks:

    scrapy.Request(url,
                   callback=self.parse, # > go to next response callback
                   errback=self.erred)  # > go to custom error callback
    

So where does that leave us with Scrapy?

Pass your data along as you go, don't hoard it ;)
This should be sufficient in almost every case, except where you have no choice but to merge Item information from multiple pages, but where those Requests can't be serialized into the following schema (more on that later).

->- flow of data ---->---------------------->
Response -> Response
           `-> Data -> Req/Response 
               Data    `-> MoreData -> Yield Item to ItemPipeline (Exporters)
               Data -> Req/Response
                       `-> MoreData -> Yield Item to ItemPipeline
 1. Gen      2. Gen        3. Gen

How you implement this model in code will depend on your use-case.

Scrapy provides the meta field in Requests/Responses for slugging along data. Despite the name it's not really 'meta', but rather quite essential. Don't avoid it, get used to it.

Doing that might seem counterintuitive, heaping along and duplicating all that data into potentially thousands newly spawned requests; but because of the way Scrapy handles references, it's not actually bad, and old objects get cleaned up early by Scrapy. In the above ASCII art, by the time your 2nd generation requests are all queued up, the 1st generation responses will be freed from memory by Scrapy, and so on. So this isn't really the memory-bloat one might think, if used correctly (and not handling lots of big files).

Another possibility to 'meta' are instance variables (global data), to store stuff in some self.data object or other, and access it in the future from your next response callback. (Never in the old one, since at that time it did not exist yet.) When doing this, remember always that it's global shared data, of course; which might have "parallel" callbacks looking at it.

And then finally sometimes one might even use external sources, like Redis-Queues or sockets to communicate data between Spider and a datastore (for example to pre-fill the start_urls).

And how could this look in code?

You can write "recursive" parse methods (actually just funnel all responses through the same callback method):

def parse(self, response):
    if response.xpath('//li[@class="next"]/a/@href').extract_first():
        yield scrapy.Request(response.urljoin(next_page_url)) # will "recurse" back to parse()

    if 'some_data' in reponse.body:
        yield { # the simplest item is a dict
            'statuscode': response.body.status,
            'data': response.body,
        }

or you can split between multiple parse methods, each handling a specific type of page/Response:

def parse(self, response):
    if response.xpath('//li[@class="next"]/a/@href').extract_first():
        request = scrapy.Request(response.urljoin(next_page_url))
        request.callback = self.parse2 # will go to parse2()
        request.meta['data'] = 'whatever'
        yield request

def parse2(self, response):
    data = response.meta.get('data')
    # add some more data
    data['more_data'] = response.xpath('//whatever/we/@found').extract()
    # yield some more requests
    for url in data['found_links']:
        request = scrapy.Request(url, callback=self.parse3)
        request.meta['data'] = data # and keep on passing it along
        yield request

def parse3(self, response):
    data = response.meta.get('data')
    # ...workworkwork...
    # finally, drop stuff to the item-pipelines
    yield data

Or even combine it like this:

def parse(self, response):
    data = response.meta.get('data', None)
    if not data: # we are on our first request
        if response.xpath('//li[@class="next"]/a/@href').extract_first():
            request = scrapy.Request(response.urljoin(next_page_url))
            request.callback = self.parse # will "recurse" back to parse()
            request.meta['data'] = 'whatever'
            yield request
        return # stop here
    # else: we already got data, continue with something else
    for url in data['found_links']:
        request = scrapy.Request(url, callback=self.parse3)
        request.meta['data'] = data # and keep on passing it along
        yield request

But this REALLY isn't good enough for my case!

Finally, one can consider these more complex approaches, to handle flow control, so those pesky async calls become predictable:

Force serialization of interdependent requests, by changing the request flow:

def start_requests(self):
    url = 'https://example.com/final'
    request = scrapy.Request(url, callback=self.parse1)
    request.meta['urls'] = [ 
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3',
    ]   
    yield request

def parse1(self, response):
    urls = response.meta.get('urls')
    data = response.meta.get('data')
    if not data:
        data = {}
    # process page response somehow
    page = response.xpath('//body').extract()
    # and remember it
    data[response.url] = page

    # keep unrolling urls
    try:
        url = urls.pop()
        request = Request(url, callback=self.parse1) # recurse
        request.meta['urls'] = urls # pass along
        request.meta['data'] = data # to next stage
        return request
    except IndexError: # list is empty
        # aggregate data somehow
        item = {}
        for url, stuff in data.items():
            item[url] = stuff
        return item

Another option for this are scrapy-inline-requests, but be aware of the downsides as well (read the project README).

@inline_requests
def parse(self, response):
    urls = [response.url]
    for i in range(10):
        next_url = response.urljoin('?page=%d' % i)
        try:
            next_resp = yield Request(next_url, meta={'handle_httpstatus_all': True})
            urls.append(next_resp.url)
        except Exception:
            self.logger.info("Failed request %s", i, exc_info=True)

    yield {'urls': urls}

Aggregate data in instance storage ("global data") and handle flow control through either or both

  • Scheduler request priorities to enforce order or responses, so we can hope that by the time the last Request is processed, everything lower-prio has finished.
  • Custom pydispatch signals for "out-of-band" notifications. While those are not really lightweight, they're a whole different layer to handle events and notifications.

This is a simple way to use custom Request priorities:

custom_settings = {
    'CONCURRENT_REQUESTS': 1,
}   
data = {}

def parse1(self, response):
    # prioritize these next requests over everything else
    urls = response.xpath('//a/@href').extract()
    for url in urls:
        yield scrapy.Request(url,
                             priority=900,
                             callback=self.parse2,
                             meta={})
    final_url = 'https://final'
    yield scrapy.Request(final_url, callback=self.parse3)

def parse2(self, response):
    # handle prioritized requests
    data = response.xpath('//what/we[/need]/text()').extract()
    self.data.update({response.url: data})

def parse3(self, response):
    # collect data, other requests will have finished by now
    # IF THE CONCURRENCY IS LIMITED, otherwise no guarantee
    return self.data

And a basic example using signals.
This listens to the internal idle event, when the Spider has crawled all requests and is sitting pretty, to use it for doing last-second cleanup (in this case, aggregating our data). We can be absolutely certain that we won't be missing out on any data at this point.

from scrapy import signals

class SignalsSpider(Spider):

    data = {}

    @classmethod 
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(Spider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.idle, signal=signals.spider_idle)
        return spider

    def idle(self, spider):
        if self.ima_done_now:
            return
        self.crawler.engine.schedule(self.finalize_crawl(), spider)
        raise DontCloseSpider

    def finalize_crawl(self):
        self.ima_done_now = True
        # aggregate data and finish
        item = self.data
        return item 

    def parse(self, response):
        if response.xpath('//li[@class="next"]/a/@href').extract_first():
            yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse2)

    def parse2(self, response):
        # handle requests
        data = response.xpath('//what/we[/need]/text()').extract()
        self.data.update({response.url: data})

A final possibility is using external sources like message-queues or redis, as already mentioned, to control the spider flow from outside. And that covers all the ways I can think of.

Once an Item is yielded/returned to the Engine, it will be passed to the ItemPipelines (which can make use of Exporters - not to be confused with FeedExporters), where you can continue to massage the data outside the Spider. A custom ItemPipeline implementation might store the items in a database, or do any number of exotic processing things on them.

Hope this helps.

(And feel free to edit this with better text or examples, or fix any errors there may be.)

Upvotes: 24

Granitosaurus
Granitosaurus

Reputation: 21436

If I understand you correctly what you want is a while chain

  1. Have some urls
  2. Crawl all of those urls to form some data
  3. Make new request using that data

Pseudo code:

queue = get_queue()
items = []
while queue is not empty:
    items.append(crawl1())
crawl2(items)

In scrapy this is a bit ugly but not difficult:

default_queue = ['url1', 'url2']
def parse(self, response):
    queue = response.meta.get('queue', self.default_queue)
    items = response.meta.get('items', [])
    if not queue:
        yield Request(make_url_from_items(items), self.parse_items)
        return
    url = queue.pop()
    item = {
        # make item from resposne
    }
    items.append(item)
    yield Request(url, meta={'queue':queue, 'items': items})

This will loop parse untill queue is empty and then yield a new request made from results. It should be noted that this would become a synchronous chain, however if you have multiple start_urls you'd still have async spider that just have multiple syncronous chains :)

Upvotes: 0

Related Questions