Tony Lâmpada
Tony Lâmpada

Reputation: 5459

Scrapy: Wait for some urls to be parsed, then do something

I have a spider that needs to find product prices. Those products are grouped together in batches (coming from a database) and it would be nice to have a batch status (RUNNING, DONE) along with start_time and finished_time attributes. So I have something like:

class PriceSpider(scrapy.Spider):
    name = 'prices'

    def start_requests(self):
        for batch in Batches.objects.all():
            batch.started_on = datetime.now()
            batch.status = 'RUNNING'
            batch.save()
            for prod in batch.get_products():
                yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod})
            batch.status = 'DONE'
            batch.finished_on = datetime.now()
            batch.save()  # <-- NOT COOL: This is goind to 
                          # execute before the last product 
                          # url is scraped, right?

    def parse(self, response):
        #...

The problem here is due to the async nature of scrapy, the second status update on the batch object is going to run too soon... right? Is there a way to group these requests together somehow and have the batch object be updated when the last one is parsed?

Upvotes: 0

Views: 1242

Answers (4)

Nguyen Duc Tien
Nguyen Duc Tien

Reputation: 3

This is my code. Two parser functions call the same AfterParserFinished() which counts the number of invocations to determine the time all parsers accomplished

countAccomplishedParsers: int = 0
        
def AfterParserFinished(self):
    self.countAccomplishedParsers =self.countAccomplishedParsers+1
    print self.countAccomplishedParsers #How many parsers have been accomplished
    if self.countAccomplishedParsers == 2:
        print("Accomplished: 2. Do something.")      


def parse1(self, response):  
    self.AfterParserFinished()
    pass

def parse2(self, response):  
    self.AfterParserFinished()
    pass

Upvotes: 0

Tony L&#226;mpada
Tony L&#226;mpada

Reputation: 5459

I made some adaptations to @Umair suggestion and came up with a solution that works great for my case:

class PriceSpider(scrapy.Spider):
    name = 'prices'

    def start_requests(self):
        for batch in Batches.objects.all():
            batch.started_on = datetime.now()
            batch.status = 'RUNNING'
            batch.save()
            products = batch.get_products()
            counter = {'curr': 0, 'total': len(products)}  # the counter dictionary 
                                                           # for this batch
            for prod in products:
                yield scrapy.Request(product.get_scrape_url(), 
                                     meta={'prod': prod, 
                                           'batch': batch, 
                                           'counter': counter})
                                     # trick = add the counter in the meta dict

    def parse(self, response):
        # process the response as desired
        batch = response.meta['batch']
        counter = response.meta['counter']
        self.increment_counter(batch, counter) # increment counter only after 
                                               # the work is done

    def increment_counter(batch, counter):
        counter['curr'] += 1
        if counter['curr'] == counter['total']:
            batch.status = 'DONE'
            batch.finished_on = datetime.now()
            batch.save()  # GOOD!
                          # Well, almost...

This works fine as long as all the Requests yielded by start_requests have different url's.

If there are any duplicates, scrapy will filter them out and not call your parse method, so you end up with counter['curr'] < counter['total'] and the batch status is left RUNNING forever.

As it turns out you can override scrapy's behaviour for duplicates.

First, we need to change settings.py to specify an alternative "duplicates filter" class:

DUPEFILTER_CLASS = 'myspiders.shopping.MyDupeFilter'

Then we create the MyDupeFilter class, that lets the spider know when there is a duplicate:

class MyDupeFilter(RFPDupeFilter):
    def log(self, request, spider):
        super(MyDupeFilter, self).log(request, spider)
        spider.look_a_dupe(request)

Then we modify our spider to make it increment our counter when a duplicate is found:

class PriceSpider(scrapy.Spider):
    name = 'prices'

    #...

    def look_a_dupe(self, request):
        batch = request.meta['batch']
        counter = request.meta['counter']
        self.increment_counter(batch, counter)

And we are good to go

Upvotes: 0

Umair Ayub
Umair Ayub

Reputation: 21201

Here is trick

With each request, send batch_id, total_products_in_this_batch and processed_this_batch

and anywhere in any function check

for batch in Batches.objects.all():
    processed_this_batch = 0
    # TODO: Get some batch_id here
    # TODO: Find a way to check total number of products in this batch and assign to `total_products_in_this_batch`

    for prod in batch.get_products():
        processed_this_batch  = processed_this_batch  + 1
        yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod, 'batch_id': batch_id, `total_products_in_this_batch`: total_products_in_this_batch, 'processed_this_batch': processed_this_batch })

And in anywhere in code, for any particular batch, check if processed_this_batch == total_products_in_this_batch then save batch

Upvotes: 2

Rafael Almeida
Rafael Almeida

Reputation: 5240

For this kind of deals you can use signal closed which you can bind a function to run when spider is done crawling.

Upvotes: 1

Related Questions