Reputation: 493
I have a scrape setup in Scrapy that is targeting 1M unique urls in a numeric sequence. For example: http://www.foo.com/PIN=000000000001
I hold the PINs in a DB. Rather than load 1M PINs into memory and creating 1M start_urls, I'm using the start_requests() function to query the DB for 5000 PINs at a time. After completing the 5000 unique URLs I want to restart the scrape and keep doing it until all 1M URLs are scraped. In the scrapy user group they recommended I use the spider_idle function to keep restarting the scrape. I have it setup correctly per the code below, but I cannot seem to find the right code to restart the scrape. See below:
class Foo(Spider):
name = 'foo'
allowed_domains = ['foo.com']
def __init__(self, *args, **kwargs):
super(Foo, self).__init__(*args, **kwargs)
dispatcher.connect(self.spider_idle, signals.spider_idle)
def spider_idle(self, spider):
print 'idle function called' # this prints correctly so I know this function is getting called.
self.start_requests() #this code does not restart the query
def start_requests(self):
data = self.coll.find({'status': 'unscraped'}).limit(5000)
for row in data:
pin = row['pin']
url = 'http://foo.com/Pages/PIN-Results.aspx?PIN={}'.format(pin)
yield Request(url, meta={'pin': pin})
What code do I need to restart the scrape?
Upvotes: 3
Views: 1451
Reputation: 473763
Instead of restarting the spider, I would query the database for unscraped
items until there is nothing left:
class Foo(Spider):
name = 'foo'
allowed_domains = ['foo.com']
def start_requests(self):
while True:
data = self.coll.find({'status': 'unscraped'}).limit(5000)
if not data:
break
for row in data:
pin = row['pin']
url = 'http://foo.com/Pages/PIN-Results.aspx?PIN={}'.format(pin)
yield Request(url, meta={'pin': pin})
You would probably need to implement a real pagination over the collection with limits and offsets.
Upvotes: 2