Scrapy: Returning list of ids after crawling

Question

I wrote a custom spider to recursively scape pages of a website and store the details of each crawl in my postgres database:

class MySpider(scrapy.Spider):
    name = 'my_spider'

    def __init__(self):
        self.start_urls = ['http://www.example.com']

    def parse(self, response):
        yield scrapy.request(self.start_urls[0], callback=self.parse_page)

    def parse_page(self, response):
        with transaction.manager:
            crawl = Crawl()
            crawl.url = response.request.url
            crawl.response_body = response.body
            Session.add(crawl)
            Session.flush()

        if len(response.css('.pager-next')) == 1:
            # build url for the next page to crawl
            # ...
            yield scrapy.Request(url=full_url, callback=self.parse_page)

The problem is that I want to get back a list of ids for the crawls that were added to the database, which another function can use.

def scrape_website():
    process = CrawlerProcess()
    process.crawl(MySpider)
    process.start() # <-- how to return crawl ids?

    parse_crawls(crawl_ids)

Any ideas?

Scrapy: Returning list of ids after crawling

Answers (1)

Related Questions