Reputation: 5459
I have a spider that needs to find product prices. Those products are grouped together in batches (coming from a database) and it would be nice to have a batch status (RUNNING, DONE) along with start_time
and finished_time
attributes.
So I have something like:
class PriceSpider(scrapy.Spider):
name = 'prices'
def start_requests(self):
for batch in Batches.objects.all():
batch.started_on = datetime.now()
batch.status = 'RUNNING'
batch.save()
for prod in batch.get_products():
yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod})
batch.status = 'DONE'
batch.finished_on = datetime.now()
batch.save() # <-- NOT COOL: This is goind to
# execute before the last product
# url is scraped, right?
def parse(self, response):
#...
The problem here is due to the async nature of scrapy, the second status update on the batch object is going to run too soon... right? Is there a way to group these requests together somehow and have the batch object be updated when the last one is parsed?
Upvotes: 0
Views: 1242
Reputation: 3
This is my code. Two parser functions call the same AfterParserFinished() which counts the number of invocations to determine the time all parsers accomplished
countAccomplishedParsers: int = 0
def AfterParserFinished(self):
self.countAccomplishedParsers =self.countAccomplishedParsers+1
print self.countAccomplishedParsers #How many parsers have been accomplished
if self.countAccomplishedParsers == 2:
print("Accomplished: 2. Do something.")
def parse1(self, response):
self.AfterParserFinished()
pass
def parse2(self, response):
self.AfterParserFinished()
pass
Upvotes: 0
Reputation: 5459
I made some adaptations to @Umair suggestion and came up with a solution that works great for my case:
class PriceSpider(scrapy.Spider):
name = 'prices'
def start_requests(self):
for batch in Batches.objects.all():
batch.started_on = datetime.now()
batch.status = 'RUNNING'
batch.save()
products = batch.get_products()
counter = {'curr': 0, 'total': len(products)} # the counter dictionary
# for this batch
for prod in products:
yield scrapy.Request(product.get_scrape_url(),
meta={'prod': prod,
'batch': batch,
'counter': counter})
# trick = add the counter in the meta dict
def parse(self, response):
# process the response as desired
batch = response.meta['batch']
counter = response.meta['counter']
self.increment_counter(batch, counter) # increment counter only after
# the work is done
def increment_counter(batch, counter):
counter['curr'] += 1
if counter['curr'] == counter['total']:
batch.status = 'DONE'
batch.finished_on = datetime.now()
batch.save() # GOOD!
# Well, almost...
This works fine as long as all the Requests yielded by start_requests have different url's.
If there are any duplicates, scrapy will filter them out and not call your parse
method,
so you end up with counter['curr'] < counter['total']
and the batch status is left RUNNING forever.
As it turns out you can override scrapy's behaviour for duplicates.
First, we need to change settings.py to specify an alternative "duplicates filter" class:
DUPEFILTER_CLASS = 'myspiders.shopping.MyDupeFilter'
Then we create the MyDupeFilter
class, that lets the spider know when there is a duplicate:
class MyDupeFilter(RFPDupeFilter):
def log(self, request, spider):
super(MyDupeFilter, self).log(request, spider)
spider.look_a_dupe(request)
Then we modify our spider to make it increment our counter when a duplicate is found:
class PriceSpider(scrapy.Spider):
name = 'prices'
#...
def look_a_dupe(self, request):
batch = request.meta['batch']
counter = request.meta['counter']
self.increment_counter(batch, counter)
And we are good to go
Upvotes: 0
Reputation: 21201
Here is trick
With each request, send batch_id
, total_products_in_this_batch
and processed_this_batch
and anywhere in any function check
for batch in Batches.objects.all():
processed_this_batch = 0
# TODO: Get some batch_id here
# TODO: Find a way to check total number of products in this batch and assign to `total_products_in_this_batch`
for prod in batch.get_products():
processed_this_batch = processed_this_batch + 1
yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod, 'batch_id': batch_id, `total_products_in_this_batch`: total_products_in_this_batch, 'processed_this_batch': processed_this_batch })
And in anywhere in code, for any particular batch, check if processed_this_batch == total_products_in_this_batch
then save batch
Upvotes: 2
Reputation: 5240
For this kind of deals you can use signal closed which you can bind a function to run when spider is done crawling.
Upvotes: 1