dfriestedt
dfriestedt

Reputation: 493

Scrapy spider_idle call to restart scrape

I have a scrape setup in Scrapy that is targeting 1M unique urls in a numeric sequence. For example: http://www.foo.com/PIN=000000000001

I hold the PINs in a DB. Rather than load 1M PINs into memory and creating 1M start_urls, I'm using the start_requests() function to query the DB for 5000 PINs at a time. After completing the 5000 unique URLs I want to restart the scrape and keep doing it until all 1M URLs are scraped. In the scrapy user group they recommended I use the spider_idle function to keep restarting the scrape. I have it setup correctly per the code below, but I cannot seem to find the right code to restart the scrape. See below:

class Foo(Spider):
    name = 'foo'
    allowed_domains = ['foo.com']

    def __init__(self, *args, **kwargs):
        super(Foo, self).__init__(*args, **kwargs)
        dispatcher.connect(self.spider_idle, signals.spider_idle)

    def spider_idle(self, spider):
        print 'idle function called' # this prints correctly so I know this function is getting called.
        self.start_requests() #this code does not restart the query

    def start_requests(self):
        data = self.coll.find({'status': 'unscraped'}).limit(5000)

        for row in data:
            pin = row['pin']
            url = 'http://foo.com/Pages/PIN-Results.aspx?PIN={}'.format(pin)
            yield Request(url, meta={'pin': pin})

What code do I need to restart the scrape?

Upvotes: 3

Views: 1451

Answers (1)

alecxe
alecxe

Reputation: 473763

Instead of restarting the spider, I would query the database for unscraped items until there is nothing left:

class Foo(Spider):
    name = 'foo'
    allowed_domains = ['foo.com']

    def start_requests(self):
        while True:
            data = self.coll.find({'status': 'unscraped'}).limit(5000)

            if not data:
                break

            for row in data:
                pin = row['pin']
                url = 'http://foo.com/Pages/PIN-Results.aspx?PIN={}'.format(pin)
                yield Request(url, meta={'pin': pin})

You would probably need to implement a real pagination over the collection with limits and offsets.

Upvotes: 2

Related Questions